CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

1Goethe University Frankfurt, 2MIT, 3IBM Research, 4MIT-IBM Watson AI Lab, 5Tuebingen AI Center/University of Tuebingen
Teaser figure

By representing audio with multiple finer-grained representations aligned with individual video frames, CAV-MAE Sync improves the precision of audio-visual alignment.

Abstract

Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment.

In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges:

  • First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.
  • Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens.
  • Third, we improve spatial localization by introducing learnable register tokens that reduce the semantic load on patch tokens.

We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification, and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.

Method Overview

Method overview

Our model processes video frames and audio segments in parallel through separate encoders $E_a$ and $E_v$, with the audio encoder $E_a$ operating on finer temporal granularity to better align with visual frames. Both modalities interact through the Joint Layer $L$ and the Joint Decoder $D$.

Results

Quantitative Results

Visual to Audio AudioSet Eval Subset VGGSound Eval Subset Audio to Visual AudioSet Eval Subset VGGSound Eval Subset
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
VAB-Encodec 39.5 65.4 74.6 33.5 63.3 74.3 37.5 64.0 73.7 34.9 62.7 73.1
CAV-MAE 16.1 38.6 49.3 14.7 35.3 45.9 9.5 22.6 32.4 8.3 23.8 32.4
CAV-MAEScale+ 18.8 39.5 50.1 14.8 34.2 44.0 15.1 34.0 43.0 12.8 30.4 40.3
LanguageBind 6.4 20.2 28.3 10.3 30.1 39.7 4.4 15.0 22.5 6.5 22.7 33.5
AVSiam 19.7 - - 19.0 - - 17.6 - - 20.4 - -
ImageBind 22.1 43.2 52.6 21.6 43.4 52.9 20.8 42.6 51.6 20.7 43.2 53.4
Ours 35.2 58.3 67.6 27.9 51.7 61.8 27.9 52.4 62.2 23.2 46.2 58.1

Zero-shot retrieval results on AudioSet and VGGSound for Visual to Audio (V→A) and Audio to Visual (A→V) tasks. Our model achieves state-of-the-art zero-shot performance across all retrieval metrics (R@1, R@5, R@10) on both datasets, surpassing baselines like ImageBind and AVSiam. Fine-tuned VAB-Encodec scores are provided as an upper bound for comparison.

🚧

Webpage under construction. Keep an eye out for updates on GitHub!

BibTeX

@inproceedings{araujo2025cavmae,
  title={CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment},
  author={Edson Araujo and Andrew Rouditchenko and Yuan Gong and Saurabhchand Bhati and Samuel Thomas and Brian Kingsbury and Leonid Karlinsky and Rogerio Feris and James R. Glass and Hilde Kuehne},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}