MutualFormer: Multi-Modality Representation Learning via Mutual Transformer

Xixi Wang, Bo Jiang, Xiao Wang, Bin Luo

My Comments and Inspiration

为了能够让不同的模态之间的 token 进行高效的通信和交流,作者受到 Regularized Diffusion Process (RDP)[26] 的启发,提出了 MutualFormer,用来替换 Transformer Block 来完成多模态的融合,这个更加灵活 (Flexible) 和高效 (Efficiency)

然而,还是有一些摸不着头脑的问题:

  1. 为啥 Inspired by RDP?他有什么优点?文中说之前的两种 cross attention 采用 RDP 的方式就能弥补?
  2. 没解释为啥 RDP 就是在 metric space 操作,需要看 RDP 原文?
  3. 采用 RDP 的动机不明,RDP 优点解释的不够清晰
  4. 整个模块感觉有点冗余

Preface

  • The task of this work is multi-modality fusion, i.e., obtaining reliable and representative feature of RGB and depth images.
  • As the author says: The fundamental challenge of multi-modality fusion is how to exploit the useful cues of both intra-modality and cross- modality simultaneously for final representation.
  • 目前,大部分给予 Transformer 的多模态融合工作都是采用两种方式进行融合的:simple concatenation [20], [22], [24] 和 intuitive cross-modality fusion [19], [21], [23], [25]。这两种方法存在下面的两种缺点
    • 前者不能充分挖掘两个模态的关系
    • 后者仅通过点乘来获取不同模态的联系,这种过于简单的方式可能导致 unreliable learning results caused by the modality gap.

为此,作者提出了 Cross-diffusion Attention (CDA) 来完成不同模态的通信交流。

Methods

实际上,MutualFormer 并不是一个网络,而是一个模块,类似 Transformer Block,作者通过改进原始的 Transformer Block ,使其可以完成多模态融合,即 MutualFormer,如下图 ||450

MutualFormer 含有三个模块: ① Self-attention (SA) for intra-modality token mixer ② Cross-diffusion Attention (CDA) for inter-modality token mixer ③ Aggregation block for final output。下面将逐一介绍

Self-attention (SA)

用于捕捉模态内的信息

即执行标准的 SA 模块。

给定两个模态的 Token Sequences , is the number of tokens and is feature dimension of tokens. SA is calculated as follows

are all obtained by conducting linear transformations on , and are outputs

Corss-diffusion Attention (CDA)

Conducting the information communication among tokens belonging to different modalities.

[! Note]- 文中,作者给出了几种常见的 Cross Attention 的计算方法 Most of them are based on the cross-similarities , for example

And then, we can calculate the cross-attention output as . However, the author think that it is simple to interact with different modalities by . The intrinsic modality/domain gap make the results are unreliable.

Inspired by Regularized Diffusion Process (RDP) [26], the author is proposed CDA.
Instead of defining cross similarities on feature space, CDA is defined on metric space.

|450

  • Get normalized similarity matrices

\begin{aligned} \hat{S}{r}=D{r}^{-\frac{1}{2}} S_{r} D_{r}^{-\frac{1}{2}} \ \hat{S}{d}=D{d}^{-\frac{1}{2}} S_{d} D_{d}^{-\frac{1}{2}} \end{aligned}

其中 $D_r, D_d$ 是对角矩阵,每个元素的值是对应 $S_r, S_d$ 行的和 (原文是 where $D_r, D_d$ is a diagonal matrix with elements defined by the row-addition of $S_r, S_d$) - Calculate CDA as ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031720524.png) 其中, $A=\frac{1}{2}(S_r+S_d)$, balancing hyper-parameter $\epsilon\in (0,1)$, $S_{rd}^{(0)}, S_{dr}^{(0)}$ can be initialed as the identity matrix $I$ or some other initial affinity matrices obtained by using other approaches, such as affinity matrix $A$. In this paper, set $S_{rd}^{(0)}=S_{dr}^{(0)}=I$ . We have ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031725730.png) > [! Question] 意义是什么?需要参考 RDP[26]?为什么归一化时的对角矩阵要计算每个行的和? > [! Discussion]- > ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031727954.png)![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031727050.png) Now, we have four components ( 2/each modality), i.e., $\{M_r, M_{rd}\}$ and $\{ M_d, M_{dr}\}$ (obtained by SA and CDA module). Then, get a more reliable representation $H$ for each modality ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031735075.png)![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031735300.png) 即 cat 起来之后通过两个卷积网络($f_{r/d}()$, 不同模态不同项权重) **Aggregation block** The final output of MutualFormer is the modality-invariant and context-aware representations $P$ for tokens. ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031736158.png) $\|$ 表示 cat,$g,h$ 分别表示两层卷积 (不共享权重),FFN 表示两个 FC 层(激活函数是 GELU),LN 表示 layerNorm 可以看到,这里有一个类似残差的东西。 ### 网络结构 由于 MutualFormer 是一个模块,作者将其应用在了 RGB 和 Depth 的两种模态的任务上,实例化的网络结构为 ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031642620.png) ## Experiments ## Some Descriptions