Unifying Vision-Language Representation Space with
Single-tower Transformer
AAAI 2023 (Oral)
- Jiho Jang* Seoul National University
- Chaerin Kong* Seoul National University
- Donghyeon Jeon NAVER
- Seonhoon Kim NAVER
- Nojun Kwak Seoul National University
Abstract
Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agnostic VLP framework.
One-tower Contrastive Learning
Vision-language models can be typically assigned to one of the four architecture prototypes. In this paper, we explore V-L representation learning with single-tower transformer, with a modality-agnostic transformer as well as generic projection layer. However, single-tower models struggle to embed information from two distant modalities (i.e., image and text) due to the innate modality gap, as shown by the t-SNE plot below.
We observe that integrating cross-modal mixup contrastive (XMC) objective to the single-tower framework successfully mitigates the modality gap and enables unified representation learning. We take a step further to devise Contextual Invariance Contrastive (CIC) objective, that enforces the final representation formed from either using image or text as the context (K, V in attention layers) to be similar to each other. By combining the two, XMC anc CIC, we stabilize the single-tower unified representation learning and improve the representation quality compared to two-tower baselines (e.g., CLIP) even with significantly less parameters.
Intriguing Properties of One Representation (OneR)
As OneR treats individual image patch and text embedding as a generic unit of information and learns to attend between each other indifferently, we discover many interesting properties such as excellent zero-shot object localization, text-guided visual reasoning and multi-modal retrieval. For object localization, we do not rely on any specific algorithms like Grad-CAM, but simply computes patch-text embedding similarities and visualize the highlighted regions. Figure below presents a more vivid comparison with competitive baselines, which employ grad-cam as patch-level similarity operation yields highly unreasonable results.
More than that, as OneR learns a single representation space for both vision and language modalities, it shows better cross-modal transfer (benefits in one modality performance after additional training in the other modality) than the baseline.
Above all, OneR records highly comparable results on traditional benchmarks like COCO retrieval even with no initialization prior and significantly less number of parameters.