Edson Araujo

I'm a PhD Student at University of Tübingen, working with Prof. Hilde Kuehne and co-advised by Dr. Jim Glass (MIT CSAIL). Our work is part of the MIT-IBM Watson AI Sight and Sound Project, where I focus on audio-visual reasoning, multimodal large language models, and test-time adaptation.

I'm currently a Research Scientist Intern at Meta on the RL Audio team.

I did my Master's in Computer Science at UFMG under the supervision of Prof. Erickson Nascimento, period in which I was able to collaborate in different research topics such as video summarization and image descriptors.

Email: [first_name]@[last_name].info

News

06.2026Started as a Research Scientist Intern at Meta on the RL Audio team.

05.2026Accepted to the CVPR 2026 Doctoral Consortium!

04.2026Recognized as a 'Top 200' reviewer at ICLR 2026.

12.2025We are organizing the fifth edition of the "What is Next in Multimodal Foundation Models?" Workshop (CVPR 2026)

08.2025Omni-R1 was accepted to ASRU 2025! (shortlisted for Best Student Paper!)

05.2025Omni-R1, our latest work from the MIT-IBM Watson AI Sight and Sound Project, is out on ArXiv!

05.2025CAV-MAE Sync is also going to be presented at the LatinX, MMFM and Sight and Sound Workshops at CVPR 2025!

02.2025CAV-MAE Sync was accepted to CVPR 2025 as a poster presentation. Paper is on ArXiv.

Research

My research asks how models can reason jointly over what they see and what they hear. I study audio-visual representation learning, building self-supervised encoders that align sound and vision at a fine temporal scale. I extend this to audio-visual reasoning with multimodal large language models, so that a model can combine evidence from both modalities to answer questions. I also work on test-time adaptation for video reasoning, adapting models to new videos without labels. Some papers are highlighted.

Service

Organizer, "What is Next in Multimodal Foundation Models?" Workshop at CVPR 2026 (fifth edition).

Reviewer for CVPR, ICLR (recognized as a 'Top 200' reviewer at ICLR 2026), and related venues.

Selected Publications

AVRT architecture diagram showing audio-visual reasoning transfer from single-modality teachers

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Edson Araujo, Saurabhchand Bhati, M. Jehanzeb Mirza, Brian Kingsbury, Samuel Thomas, Rogerio Feris, James R. Glass, Hilde Kuehne

arXiv preprint, 2026

Project · arXiv · BibTeX

Composes multimodal reasoning traces from single-modality teachers via LLM merging. Achieves +7.8 avg. improvement on audio-visual benchmarks and +6.3 on audio benchmarks.

TTA-Vid framework diagram for test-time adaptation on instructional videos

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

Soumya Shamarao Jahagirdar*, Edson Araujo*, Anna Kukleva, M. Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R. Glass, Hilde Kuehne

arXiv preprint, 2026

Project · arXiv · BibTeX

Adapts video-language models at test time using step-by-step frame reasoning and multi-armed bandit frame selection. No labels required.

Omni-R1 model architecture for audio LLM fine-tuning

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

ASRU 2025

arXiv · BibTeX

Fine-tunes Qwen2.5-Omni using GRPO, achieving SOTA on MMAU. Surprisingly, text-only fine-tuning also improves audio performance.

CAV-MAE Sync architecture showing fine-grained audio-visual alignment

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne

CVPR 2025

Project · Code · arXiv · BibTeX

Fine-grained audio-visual alignment using temporal sequences instead of global representations. Outperforms complex architectures on retrieval, classification, and localization.

VDAN+ architecture for text-driven video acceleration using reinforcement learning

Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method

Washington Ramos, Michel Silva, Edson Araujo, Victor Moura, Keller Oliveira, Leandro Soriano Marcolino

TPAMI 2023

Project · Code · arXiv · BibTeX

Weakly-supervised RL method for text-driven video acceleration using VDAN+ architecture.

VDAN network diagram for video fast-forwarding via reinforcement learning

Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data

Washington Ramos, Michel Silva, Edson Araujo, Leandro Soriano Marcolino, Erickson Nascimento

CVPR 2020

Project · Code · arXiv · BibTeX

RL-based video acceleration using textual guidance and the VDAN architecture. SOTA on F1 and segment coverage.

System diagram for personalized video fast-forwarding using social network data

Personalizing Fast-Forward Videos Based on Visual and Textual Features from Social Network

Washington Ramos, Michel Silva, Edson Araujo, Alan Neves, Erickson Nascimento

WACV 2020

Project · arXiv · BibTeX

Personalized FPV fast-forwarding using social network data to infer user interests.

Experimental setup showing driving simulation with auditory annoyance measurement

On Modeling the Effects of Auditory Annoyance on Driving Style and Passenger Comfort

Edson Araujo, Michal Gregor, Isabella Huang, Erickson R. Nascimento, Ruzena Bajcsy

IROS 2019

Paper · BibTeX

Detects driver annoyance from inertial measurements with 77% accuracy. Studies acoustic impact on driving style.

For a full publication list, see my Google Scholar.