Composes multimodal reasoning traces from single-modality teachers via LLM merging. Achieves +7.8 avg. improvement on audio-visual benchmarks and +6.3 on audio benchmarks.
I'm a PhD Student at University of Tübingen, working with Prof. Hilde Kuehne and co-advised by Dr. Jim Glass (MIT CSAIL). Our work is part of the MIT-IBM Watson AI Sight and Sound Project, where I focus on audio-visual reasoning, multimodal large language models, and test-time adaptation.
I did my Master's in Computer Science at UFMG under the supervision of Prof. Erickson Nascimento, period in which I was able to collaborate in different research topics such as video summarization and image descriptors.
Email: [first_name]@[last_name].info
05.2026Accepted to the CVPR 2026 Doctoral Consortium!
04.2026Recognized as a 'Top 200' reviewer at ICLR 2026.
12.2025We are organizing the fifth edition of the "What is Next in Multimodal Foundation Models?" Workshop (CVPR 2026)
08.2025Omni-R1 was accepted to ASRU 2025! (shortlisted for Best Student Paper!)
05.2025Omni-R1, our latest work from the MIT-IBM Watson AI Sight and Sound Project, is out on ArXiv!
05.2025CAV-MAE Sync is also going to be presented at the LatinX, MMFM and Sight and Sound Workshops at CVPR 2025!
02.2025CAV-MAE Sync was accepted to CVPR 2025 as a poster presentation. Paper is on ArXiv.
10.2023I joined my PhD Program under the supervision of Prof. Hilde Kuehne to work on multimodal learning.
05.2023I defended my Master Thesis on "An Audiovisual Approach for Video Summarization Using Psychoacoustic Features"
I'm interested in audio-visual reasoning, multimodal large language models, self-supervised learning, and test-time adaptation. Some papers are highlighted.
Composes multimodal reasoning traces from single-modality teachers via LLM merging. Achieves +7.8 avg. improvement on audio-visual benchmarks and +6.3 on audio benchmarks.
Adapts video-language models at test time using step-by-step frame reasoning and multi-armed bandit frame selection. No labels required.
Fine-tunes Qwen2.5-Omni using GRPO, achieving SOTA on MMAU. Surprisingly, text-only fine-tuning also improves audio performance.
Fine-grained audio-visual alignment using temporal sequences instead of global representations. Outperforms complex architectures on retrieval, classification, and localization.
Weakly-supervised RL method for text-driven video acceleration using VDAN+ architecture.
RL-based video acceleration using textual guidance and the VDAN architecture. SOTA on F1 and segment coverage.
Personalized FPV fast-forwarding using social network data to infer user interests.
Detects driver annoyance from inertial measurements with 77% accuracy. Studies acoustic impact on driving style.