
By representing audio with multiple finer-grained representations aligned with individual video frames, CAV-MAE Sync improves the precision of audio-visual alignment.
By representing audio with multiple finer-grained representations aligned with individual video frames, CAV-MAE Sync improves the precision of audio-visual alignment.
Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment.
In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges:
We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification, and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
Our model processes video frames and audio segments in parallel through separate encoders $E_a$ and $E_v$, with the audio encoder $E_a$ operating on finer temporal granularity to better align with visual frames. Both modalities interact through the Joint Layer $L$ and the Joint Decoder $D$.
Visual to Audio | AudioSet Eval Subset | VGGSound Eval Subset | Audio to Visual | AudioSet Eval Subset | VGGSound Eval Subset | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
VAB-Encodec | 39.5 | 65.4 | 74.6 | 33.5 | 63.3 | 74.3 | 37.5 | 64.0 | 73.7 | 34.9 | 62.7 | 73.1 | |
CAV-MAE | 16.1 | 38.6 | 49.3 | 14.7 | 35.3 | 45.9 | 9.5 | 22.6 | 32.4 | 8.3 | 23.8 | 32.4 | |
CAV-MAEScale+ | 18.8 | 39.5 | 50.1 | 14.8 | 34.2 | 44.0 | 15.1 | 34.0 | 43.0 | 12.8 | 30.4 | 40.3 | |
LanguageBind | 6.4 | 20.2 | 28.3 | 10.3 | 30.1 | 39.7 | 4.4 | 15.0 | 22.5 | 6.5 | 22.7 | 33.5 | |
AVSiam | 19.7 | - | - | 19.0 | - | - | 17.6 | - | - | 20.4 | - | - | |
ImageBind | 22.1 | 43.2 | 52.6 | 21.6 | 43.4 | 52.9 | 20.8 | 42.6 | 51.6 | 20.7 | 43.2 | 53.4 | |
Ours | 35.2 | 58.3 | 67.6 | 27.9 | 51.7 | 61.8 | 27.9 | 52.4 | 62.2 | 23.2 | 46.2 | 58.1 |
Zero-shot retrieval results on AudioSet and VGGSound for Visual to Audio (V→A) and Audio to Visual (A→V) tasks. Our model achieves state-of-the-art zero-shot performance across all retrieval metrics (R@1, R@5, R@10) on both datasets, surpassing baselines like ImageBind and AVSiam. Fine-tuned VAB-Encodec scores are provided as an upper bound for comparison.
🚧
Webpage under construction. Keep an eye out for updates on GitHub!
@inproceedings{araujo2025cavmae,
title={CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment},
author={Edson Araujo and Andrew Rouditchenko and Yuan Gong and Saurabhchand Bhati and Samuel Thomas and Brian Kingsbury and Leonid Karlinsky and Rogerio Feris and James R. Glass and Hilde Kuehne},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}