Edson Araujo

I'm a PhD Student at University of Tübingen, working with Prof. Hilde Kuehne. Our work is part of the MIT-IBM Watson AI Sight and Sound Project, where I focus on audio-visual reasoning, multimodal large language models, and test-time adaptation.

I did my Master's in Computer Science at UFMG under the supervision of Prof. Erickson Nascimento, period in which I was able to collaborate in different research topics such as video summarization and image descriptors.

Email:

Portrait photo of Edson Araujo

News

12.2025 We are organizing the fifth edition of the "What is Next in Multimodal Foundation Models?" Workshop (CVPR 2026)

08.2025 Omni-R1 was accepted to ASRU 2025! (shortlisted for Best Student Paper!)

05.2025 Omni-R1, our latest work from the MIT-IBM Watson AI Sight and Sound Project, is out on ArXiv!

05.2025 CAV-MAE Sync is also going to be presented at the LatinX, MMFM and Sight and Sound Workshops at CVPR 2025!

02.2025 CAV-MAE Sync was accepted to CVPR 2025 as a poster presentation. Paper is on ArXiv.

Research

I'm interested in audio-visual reasoning, multimodal large language models, self-supervised learning, and test-time adaptation. Some papers are highlighted.

Selected Publications

AVRT architecture diagram showing audio-visual reasoning transfer from single-modality teachers AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne
Under review

Generates high-quality audio-visual reasoning traces from single-modality teachers. Achieves superior performance on OmniBench, DailyOmni, and MMAR benchmarks.

TTA-Vid framework diagram for test-time adaptation on instructional videos TTA-Vid: Test-Time Adaptation for Long Instructional Videos
Soumya Shamarao Jahagirdar*, Edson Araujo*, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne
Under review

Adapts video-language models at test time using step-by-step frame reasoning and multi-armed bandit frame selection. No labels required.

Omni-R1 model architecture for audio LLM fine-tuning Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
ASRU 2025

Fine-tunes Qwen2.5-Omni using GRPO, achieving SOTA on MMAU. Surprisingly, text-only fine-tuning also improves audio performance.

CAV-MAE Sync architecture showing fine-grained audio-visual alignment CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
CVPR 2025

Fine-grained audio-visual alignment using temporal sequences instead of global representations. Outperforms complex architectures on retrieval, classification, and localization.

VDAN+ architecture for text-driven video acceleration using reinforcement learning Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method
Washington Ramos, Michel Silva, Edson Araujo, Victor Moura, Keller Oliveira, Leandro Soriano Marcolino
TPAMI 2023

Weakly-supervised RL method for text-driven video acceleration using VDAN+ architecture.

VDAN network diagram for video fast-forwarding via reinforcement learning Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data
Washington Ramos, Michel Silva, Edson Araujo, Leandro Soriano Marcolino, Erickson Nascimento
CVPR 2020

RL-based video acceleration using textual guidance and the VDAN architecture. SOTA on F1 and segment coverage.

System diagram for personalized video fast-forwarding using social network data Personalizing Fast-Forward Videos Based on Visual and Textual Features from Social Network
Washington Ramos, Michel Silva, Edson Araujo, Alan Neves, Erickson Nascimento
WACV 2020

Personalized FPV fast-forwarding using social network data to infer user interests.

Experimental setup showing driving simulation with auditory annoyance measurement On Modeling the Effects of Auditory Annoyance on Driving Style and Passenger Comfort
Edson Araujo, Michal Gregor, Isabella Huang, Erickson R. Nascimento, Ruzena Bajcsy
IROS 2019

Detects driver annoyance from inertial measurements with 77% accuracy. Studies acoustic impact on driving style.