Analyzing Multimodal Objectives Through
the Lens of Generative Diffusion Guidance
ICLR 2023 Workshop on Multimodal Representation Learning (Spotlight)

Abstract

Recent years have witnessed astonishing advances in the field of multimodal representation learning, with contrastive learning being the cornerstone for major breakthroughs. Latest works delivered further improvements by incorporating different objectives such as masked modeling and captioning into the frameworks, but our understanding on how these objectives facilitate learning remains vastly incomplete. In this paper, we leverage the fact that classifier-guided diffusion models generate images that reflect the semantic signals provided by the classifier to study the characteristics of multimodal learning objectives. Specifically, we compare contrastive, matching and captioning loss in terms of their semantic signals, and introduce a simple baseline that not only supports our analyses but also improves the quality of generative guidance in a straightforward manner.

Multimodal Objectives as Generative Guidance

Our goal is to study the properties of multimodal objectives by visualizing their guidance with a diffusion model. To that end, we employ two pretrained models. For the generative backbone, we use a generic unconditional diffusion model trained on Imagenet (Dhariwal & Nichol, 2021). This model suits our purpose because 1) it is unconditional, which means the capacity for class-conditioned synthesis solely depends on the classifier guidance and 2) it has just the right level of generative capacity to faithfully visualize semantic signals but not to step further and compensate for their blind spots. For the guidance model, we use BLIP (Li et al., 2022a), which is pretrained on 129M image-text pairs and supports image-text contrastive (ITC), image-text matching (ITM) and captioning (CAP). This minimizes unwanted compounding effects from using multiple models with differing specs. We use the classifier-guided diffusion introduced in Dhariwal & Nichol (2021) with compute-efficient modifications of Avrahami et al. (2022).

Findings

overview

1. While ITC focuses on the fine details of the salient object, CAP tends to reason about the global scene composition.

2. ITC commonly lumps visual semantics together to forcefully form a global semantic.

3. Patch-token cross-attention plays a key role in fine-grained visual understanding.

4. Dense supervision makes the representations more robust to noise perturbations.


overview

5. CAP is a more indirect if not challenging form of supervision than ITC or ITM.

New Baseline: Guidance Shift

Based on the above findings, we propose a simple yet effective modification to the previous regime that takes advantage of both ends. Specifically, we start from CAP guidance to outline the overall scene structure and gradually shift towards ITC for refined details. We compare this with naive baselines, simple ITC, simple CAP and BLEND, where we use these two signals but simply mix them without the gradual transition. We report both qualitative and quantitative human evaluations. Ours not only supports our empirical insights but also improves complex scene generation in an extremely straightforward manner.

overview
overview

Additional Results

We hereby display additional examples that further clarifies our analysis.

overview
overview

As opposed to ITC that generates realistic samples but the output severely oscillate even with minor typos, dense supervisions shows much better robustness, which can come in handy in a typical V-L setting where we rely on a massive web-crawled noisy image-text database.


overview
overview