AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

1University of Tubingen, 2MIT, 3IBM Research, 4MIT-IBM Watson AI Lab, 5Tuebingen AI Center

Abstract

Recent advances in reasoning models have achieved remarkable progress in text-based and single-modality settings, yet transferring these capabilities to multimodal domains such as audio-visual understanding remains challenging, primarily due to the scarcity of high-quality multimodal reasoning data.

We introduce AVRT (Audio-Visual Reasoning Transfer), a framework that addresses this data bottleneck by composing multimodal reasoning traces from single-modality teacher models. Our key insight is that cross-modal reasoning can emerge from integrating independently generated modality-specific chains of thought: a visual reasoning teacher and an audio reasoning teacher each produce modality-specific reasoning traces, which are then merged by a text-only LLM into unified audio-visual reasoning chains that explicitly correlate cross-modal evidence.

These composed traces serve as supervision for a two-stage training pipeline: supervised fine-tuning (SFT) provides a reasoning cold start, followed by reinforcement learning (GRPO) on larger-scale data. Across seven benchmarks spanning audio-visual and audio-only tasks, our 3B-parameter model achieves an average improvement of +7.8 points on audio-visual benchmarks and +6.3 on audio benchmarks, while our analysis reveals that audio-visual training improves even single-modality reasoning, providing evidence of cross-modal reasoning transfer.

Method Overview

AVRT pipeline overview

Overview of the AVRT pipeline: We first generate reasoning chains from single-modality teacher models prompted in the format they were optimized for, then leverage an LLM merger to aggregate the information into cross-modal reasoning traces. The resulting audio-visual traces are used to train a student model via supervised fine-tuning (SFT) followed by GRPO fine-tuning.

Results

Comparison to State-of-the-Art

Model Reasoning Audio-Visual Audio
AVQA DailyOmni OmniBench Video-MME Riva-S Riva-A MMAR MMAU
3B Audio-Visual Models
AVATAR - 44.7 45.8 - - - - -
Qwen2.5 Omni* - 88.3 43.1 50.2 55.4 62.4 38.4 53.7 61.1
AVRT (Ours) 3B 91.1 (+2.8) 49.2 (+6.1) 56.3 (+6.1) 62.6 (+7.2) 71.3 (+8.9) 49.3 (+10.9) 57.3 (+3.6) 70.0 (+8.9)
7B Audio-Visual Models
EchoInk - 46.2 46.5 - - - - -
Omni-R1 - 46.8 46.9 60.7 - - - -
HumanOmni - 47.6 44.9 - - - - -
Ola-7B - - 52.3 45.3 68.4 - - - -
AV-Reasoner - 53.8 48.3 - - - - -
AVATAR - 47.0 49.1 - - - - -
V-RTS* 86.6 47.8 35.3 63.0 70.3 48.9 N/A N/A
v-SALMONN-o1* 84.8 64.0 40.5 41.3 76.7 48.3 51.2 53.8
Qwen2.5 Omni* - 84.9 51.5 50.7 58.1 74.9 41.3 56.5 71.9
AVRT (Ours) 7B 90.4 (+5.5) 54.4 (+2.9) 57.1 (+6.4) 64.1 (+6.0) 75.3 (+0.4) 50.7 (+9.4) 59.1 (+2.6) 75.4 (+3.5)

Comparison across seven benchmarks. AVQA is in-domain (training derived from its training set). All other benchmarks are zero-shot. Reproduced baselines marked with *. AVRT 3B achieves +7.8 AV avg. and +6.3 audio avg. improvement over the Qwen2.5-Omni baseline.

Qualitative Results

Qualitative results showing audio-visual reasoning traces

Qualitative results of the AVRT-trained model on OmniBench. The model retrieves audio and visual information to answer the question, combines the two sources of information, and generates high-quality reasoning chains based on different cues in both modalities.

BibTeX

@article{araujo2025avrt,
  title={AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers},
  author={Araujo, Edson and Bhati, Saurabhchand and Mirza, M Jehanzeb and Kingsbury, Brian and Thomas, Samuel and Feris, Rogerio and Glass, James R and Kuehne, Hilde},
  year={2025}
}