MutualFormer: Multi-Modality Representation Learning via Mutual Transformer

Xixi Wang, Bo Jiang, Xiao Wang, Bin Luo

Links

PDF Attachments: 2021’MutualFormer_Wang et al.pdf

Zotero Links: Local library

My Comments and Inspiration

为了能够让不同的模态之间的 token 进行高效的通信和交流，作者受到 Regularized Diffusion Process (RDP)[26] 的启发，提出了 MutualFormer，用来替换 Transformer Block 来完成多模态的融合，这个更加灵活 (Flexible) 和高效 (Efficiency)

然而，还是有一些摸不着头脑的问题：

为啥 Inspired by RDP？他有什么优点？文中说之前的两种 cross attention 采用 RDP 的方式就能弥补？
没解释为啥 RDP 就是在 metric space 操作，需要看 RDP 原文？
采用 RDP 的动机不明，RDP 优点解释的不够清晰
整个模块感觉有点冗余

Preface

The task of this work is multi-modality fusion, i.e., obtaining reliable and representative feature of RGB and depth images.
As the author says: The fundamental challenge of multi-modality fusion is how to exploit the useful cues of both intra-modality and cross- modality simultaneously for final representation.
目前，大部分给予 Transformer 的多模态融合工作都是采用两种方式进行融合的：simple concatenation [20], [22], [24] 和 intuitive cross-modality fusion [19], [21], [23], [25]。这两种方法存在下面的两种缺点
- 前者不能充分挖掘两个模态的关系
- 后者仅通过点乘来获取不同模态的联系，这种过于简单的方式可能导致 unreliable learning results caused by the modality gap.

为此，作者提出了 Cross-diffusion Attention (CDA) 来完成不同模态的通信交流。

Methods

实际上，MutualFormer 并不是一个网络，而是一个模块，类似 Transformer Block，作者通过改进原始的 Transformer Block ，使其可以完成多模态融合，即 MutualFormer，如下图 ||450

MutualFormer 含有三个模块： ① Self-attention (SA) for intra-modality token mixer ② Cross-diffusion Attention (CDA) for inter-modality token mixer ③ Aggregation block for final output。下面将逐一介绍

Self-attention (SA)

用于捕捉模态内的信息

即执行标准的 SA 模块。

给定两个模态的 Token Sequences $X_{r}, X_{d} \in R^{n \times d}$ , $n$ is the number of tokens and $d$ is feature dimension of tokens. SA is calculated as follows

$Q / K / V_{r / d}$ are all obtained by conducting linear transformations on $X_{r}, X_{d}$ , $M_{r}$ and $M_{d}$ are outputs

Corss-diffusion Attention (CDA)

Conducting the information communication among tokens belonging to different modalities.

[! Note]- 文中，作者给出了几种常见的 Cross Attention 的计算方法 Most of them are based on the cross-similarities $S_{r d}, S_{d r}$ , for example

$S_{r d} = S_{d r} = S_{d} + S_{r}$

$S_{r d} = S o f t ma x (Q_{r} K_{d}^{⊤})$ $S_{d r} = S o f t ma x (Q_{d} K_{r}^{⊤})$

And then, we can calculate the cross-attention output as $M_{r d} = S_{r d} V_{d}, M_{d r} = S_{d r} V_{r}$ . However, the author think that it is simple to interact with different modalities by $Q_{r} K_{d}^{⊤}$ . The intrinsic modality/domain gap make the results are unreliable.

Inspired by Regularized Diffusion Process (RDP) [26], the author is proposed CDA.
Instead of defining cross similarities on feature space, CDA is defined on metric space.

|450

Get normalized similarity matrices

\begin{aligned} \hat{S}{r}=D{r}^{-\frac{1}{2}} S_{r} D_{r}^{-\frac{1}{2}} \ \hat{S}{d}=D{d}^{-\frac{1}{2}} S_{d} D_{d}^{-\frac{1}{2}} \end{aligned}

其中 $D_r, D_d$ 是对角矩阵，每个元素的值是对应 $S_r, S_d$ 行的和 (原文是 where $D_r, D_d$ is a diagonal matrix with elements defined by the row-addition of $S_r, S_d$） - Calculate CDA as ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031720524.png) 其中, $A=\frac{1}{2}(S_r+S_d)$, balancing hyper-parameter $\epsilon\in (0,1)$, $S_{rd}^{(0)}, S_{dr}^{(0)}$ can be initialed as the identity matrix $I$ or some other initial affinity matrices obtained by using other approaches, such as affinity matrix $A$. In this paper, set $S_{rd}^{(0)}=S_{dr}^{(0)}=I$ . We have ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031725730.png) > [! Question] 意义是什么？需要参考 RDP[26]？为什么归一化时的对角矩阵要计算每个行的和？ > [! Discussion]- > ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031727954.png)![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031727050.png) Now, we have four components ( 2/each modality), i.e., $\{M_r, M_{rd}\}$ and $\{ M_d, M_{dr}\}$ (obtained by SA and CDA module). Then, get a more reliable representation $H$ for each modality ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031735075.png)![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031735300.png) 即 cat 起来之后通过两个卷积网络（$f_{r/d}()$, 不同模态不同项权重） **Aggregation block** The final output of MutualFormer is the modality-invariant and context-aware representations $P$ for tokens. ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031736158.png) $\|$ 表示 cat，$g,h$ 分别表示两层卷积 (不共享权重)，FFN 表示两个 FC 层（激活函数是 GELU），LN 表示 layerNorm 可以看到，这里有一个类似残差的东西。 ### 网络结构 由于 MutualFormer 是一个模块，作者将其应用在了 RGB 和 Depth 的两种模态的任务上，实例化的网络结构为 ![](https://perrin-cos-1302722167.cos.ap-beijing.myqcloud.com/images/202212031642620.png) ## Experiments ## Some Descriptions

My Obsidian Blog

探索

MutualFormer:Multi-Modality Representation Learning via Mutual Transformer

MutualFormer: Multi-Modality Representation Learning via Mutual Transformer

My Comments and Inspiration

Preface

Methods

关系图谱

目录