Recent advances in reasoning models have achieved remarkable progress in text-based and single-modality settings, yet transferring these capabilities to multimodal domains such as audio-visual understanding remains challenging, primarily due to the scarcity of high-quality multimodal reasoning data.
We introduce AVRT (Audio-Visual Reasoning Transfer), a framework that addresses this data bottleneck by composing multimodal reasoning traces from single-modality teacher models. Our key insight is that cross-modal reasoning can emerge from integrating independently generated modality-specific chains of thought: a visual reasoning teacher and an audio reasoning teacher each produce modality-specific reasoning traces, which are then merged by a text-only LLM into unified audio-visual reasoning chains that explicitly correlate cross-modal evidence.
These composed traces serve as supervision for a two-stage training pipeline: supervised fine-tuning (SFT) provides a reasoning cold start, followed by reinforcement learning (GRPO) on larger-scale data. Across seven benchmarks spanning audio-visual and audio-only tasks, our 3B-parameter model achieves an average improvement of +7.8 points on audio-visual benchmarks and +6.3 on audio benchmarks, while our analysis reveals that audio-visual training improves even single-modality reasoning, providing evidence of cross-modal reasoning transfer.