AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

Abstract

Recent advances in reasoning models have achieved remarkable progress in text-based and single-modality settings, yet transferring these capabilities to multimodal domains such as audio-visual understanding remains challenging, primarily due to the scarcity of high-quality multimodal reasoning data.

We introduce AVRT (Audio-Visual Reasoning Transfer), a framework that addresses this data bottleneck by composing multimodal reasoning traces from single-modality teacher models. Our key insight is that cross-modal reasoning can emerge from integrating independently generated modality-specific chains of thought: a visual reasoning teacher and an audio reasoning teacher each produce modality-specific reasoning traces, which are then merged by a text-only LLM into unified audio-visual reasoning chains that explicitly correlate cross-modal evidence.

These composed traces serve as supervision for a two-stage training pipeline: supervised fine-tuning (SFT) provides a reasoning cold start, followed by reinforcement learning (GRPO) on larger-scale data. Across seven benchmarks spanning audio-visual and audio-only tasks, our 3B-parameter model achieves an average improvement of +7.8 points on audio-visual benchmarks and +6.3 on audio benchmarks, while our analysis reveals that audio-visual training improves even single-modality reasoning, providing evidence of cross-modal reasoning transfer.

Method Overview

Overview of the AVRT pipeline: We first generate reasoning chains from single-modality teacher models prompted in the format they were optimized for, then leverage an LLM merger to aggregate the information into cross-modal reasoning traces. The resulting audio-visual traces are used to train a student model via supervised fine-tuning (SFT) followed by GRPO fine-tuning.

Results

Comparison to State-of-the-Art

Model	Reasoning	Audio-Visual						Audio
Model	Reasoning	AVQA^†	DailyOmni	OmniBench	Video-MME	Riva-S	Riva-A	MMAR	MMAU
3B Audio-Visual Models
AVATAR	✓	-	44.7	45.8	-	-	-	-	-
Qwen2.5 Omni*	-	88.3	43.1	50.2	55.4	62.4	38.4	53.7	61.1
AVRT (Ours) 3B	✓	91.1 (+2.8)	49.2 (+6.1)	56.3 (+6.1)	62.6 (+7.2)	71.3 (+8.9)	49.3 (+10.9)	57.3 (+3.6)	70.0 (+8.9)
7B Audio-Visual Models
EchoInk	✓	-	46.2	46.5	-	-	-	-	-
Omni-R1	✓	-	46.8	46.9	60.7	-	-	-	-
HumanOmni	✓	-	47.6	44.9	-	-	-	-	-
Ola-7B	-	-	52.3	45.3	68.4	-	-	-	-
AV-Reasoner	✓	-	53.8	48.3	-	-	-	-	-
AVATAR	✓	-	47.0	49.1	-	-	-	-	-
V-RTS*	✓	86.6	47.8	35.3	63.0	70.3	48.9	N/A	N/A
v-SALMONN-o1*	✓	84.8	64.0	40.5	41.3	76.7	48.3	51.2	53.8
Qwen2.5 Omni*	-	84.9	51.5	50.7	58.1	74.9	41.3	56.5	71.9
AVRT (Ours) 7B	✓	90.4 (+5.5)	54.4 (+2.9)	57.1 (+6.4)	64.1 (+6.0)	75.3 (+0.4)	50.7 (+9.4)	59.1 (+2.6)	75.4 (+3.5)

Comparison across seven benchmarks. AVQA is in-domain (training derived from its training set). All other benchmarks are zero-shot. Reproduced baselines marked with *. AVRT 3B achieves +7.8 AV avg. and +6.3 audio avg. improvement over the Qwen2.5-Omni baseline.

Qualitative Results

Qualitative results of the AVRT-trained model on OmniBench. The model retrieves audio and visual information to answer the question, combines the two sources of information, and generates high-quality reasoning chains based on different cues in both modalities.