Edson Araujo

I'm a PhD Student at Goethe University Frankfurt, working with Prof. Hilde Kuehne. Our work is part of the MIT-IBM Watson AI Sight and Sound Project, through which we work with several researchers on multi-modal learning.

I did my Master's in Computer Science at UFMG under the supervision of Prof. Erickson Nascimento, period in which I was able to collaborate in different research topics such as video summarization and image descriptors.

CV  /  Scholar  /  Twitter  /  Bluesky  /  Github

Email: [last_name] at uni-frankfurt.de

profile photo

🔥 News

05.2025  Omni-R1, our latest work from the MIT-IBM Watson AI Sight and Sound Project, is out on ArXiv!

05.2025  CAV-MAE Sync is also going to be presented at the LatinX, MMFM and Sight and Sound Workshops at CVPR 2025!

02.2025  CAV-MAE Sync was accepted to CVPR 2025 as a poster presentation. Paper is on ArXiv.

Show more

Research

I'm interested in multimodal learning, self-supervised methods and audiovisual representation learning. Some papers are highlighted.

Selected Publications

Omni-R1 Placeholder Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
ArXiv, 2025
arXiv

Omni-R1 fine-tunes the Qwen2.5-Omni multi-modal LLM on an audio QA dataset using GRPO, achieving SOTA on the MMAU benchmark. Surprisingly, much of this improvement stems from enhanced text-based reasoning, and fine-tuning on a text-only dataset also boosted audio-based performance.

CAV-MAE Placeholder CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
CVPR, 2025
Project Page / Code / arXiv

We improved audio-visual learning by treating audio as a temporal sequence aligned with video frames instead of global representations, and by separating competing objectives with dedicated tokens. Our approach outperforms more complex architectures on retrieval, classification, and localization tasks across multiple datasets.

Paper Placeholder Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method
Washington Ramos, Michel Silva, Edson Araujo, Victor Moura, Keller Oliveira, Leandro Soriano Marcolino
TPAMI, 2023
Project Page / Code / arXiv

This paper introduces a novel weakly-supervised reinforcement learning method to accelerate instructional videos using text, addressing limitations of current summarization techniques. The proposed approach uses a new joint reward function and the Extended Visually-guided Document Attention Network (VDAN+) to effectively control video length while achieving state-of-the-art performance.

Paper Placeholder CVPR 2020 Straight to the Point: Fast-Forwarding Videos via Reinforcement Learning Using Textual Data
Washington Ramos, Michel Silva, Edson Araujo, Leandro Soriano Marcolino, Erickson Nascimento
CVPR, 2020
Project Page / Code / arXiv

This paper presents a novel reinforcement learning methodology to accelerate instructional videos by adaptively selecting and removing irrelevant frames, addressing visual gaps in traditional summarization. The approach utilizes a textually and visually oriented agent with a new Visually-guided Document Attention Network (VDAN) for creating shorter, coherent videos and achieves state-of-the-art F1 score and segment coverage.

WACV 2020 Paper Personalizing Fast-Forward Videos Based on Visual and Textual Features from Social Network
Washington Ramos, Michel Silva, Edson Araujo, Alan Neves, Erickson Nascimento
WACV, 2020
Project Page / arXiv

This paper introduces an approach for automatically creating personalized fast-forward videos from First-Person Videos (FPVs) by leveraging text-centric data from a user's social networks to infer their topics of interest. The method assigns scores to input frames based on user preferences and significantly outperforms competitors in F1 score, with its effectiveness demonstrated through extensive experiments and a user study.

IROS 2019 Paper On Modeling the Effects of Auditory Annoyance on Driving Style and Passenger Comfort
Edson Araujo, Michal Gregor, Isabella Huang, Erickson R. Nascimento, Ruzena Bajcsy
IROS, 2019
Paper

This study investigates the impact of acoustic annoyance on drivers in real-world scenarios, revealing significant differences in driving styles and presenting an online classifier that detects driver annoyance from inertial measurements with 77% accuracy. While empirically confirming a passenger dynamics model using pressure mat data, the study couldn't establish a change in self-reported passenger comfort due to acoustically induced driving styles not being sufficiently distinct.

Design and source code from Jon Barron's website