TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

* Equal contribution

¹University of Tubingen, ²MIT, ³IBM Research, ⁴MIT-IBM Watson AI Lab, ⁵Tuebingen AI Center, ⁶Max Planck Institute for Informatics, SIC

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains.

In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model.

We show that the resulting model, trained on a single batch or even a single sample from a dataset, is able to generalize at test time to the whole dataset and even across datasets. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation.

Method Overview

TTA-Vid performs test-time adaptation of model parameters through a batch-aware reinforcement learning objective and adaptively selects the most informative frames using a multi-armed bandit approach. Both components leverage a shared reward signal computed across multiple video frame subsets, enabling the model to jointly learn what to predict and which frames to attend to.

Results

Comparison to State-of-the-Art

Model	#params	#frames	VideoMMMU	MMVU	SciVideoBench	VideoMME	LongVideoBench	Avg.
Proprietary Models
GPT-4o	-	-	61.22	75.40	24.90	71.90	66.70	60.02
Gemini 1.5 Flash	-	-	49.78	58.80	-	70.30	61.60	-
Non-Reasoning Models
LLaVA-OneVision	7B	64	33.89	49.20	18.80	58.20	56.30	43.27
Qwen-2.5-VL	7B	≤768	47.44	59.20	16.40	56.00	65.10	48.82
InternVL-3	8B	32	49.33	60.80	26.50	61.18	59.08	51.37
Reasoning Models
Video-RTS	7B	128	52.70	66.40	-	63.00	56.60	-
Video-R1	7B	128	47.00	64.00	26.80	64.30	57.60	51.94
VideoChat-R1	7B	128	52.00	64.80	26.50	64.10	54.30	52.34
VideoChat-R1.5	7B	128	50.00	67.00	25.80	64.80	53.60	52.24
Video-RFT	7B	128	48.10	66.70	25.70	64.10	57.00	52.32
Video-R2	7B	128	50.80	67.40	28.40	63.80	59.20	53.92
Test-Time Adaptation (Ours)
TTRV* (w/ Qwen2.5-VL)	7B	32	45.89	64.48	25.50	59.26	57.07	50.46
TTA-Vid (w/ Qwen2.5-VL)	7B	32	49.44	65.12	25.70	61.14	57.81	51.84
TTA-Vid (w/ InternVL-3)	8B	32	53.66	65.60	29.80	65.11	61.48	55.13

TTA-Vid outperforms video reasoning models trained on large-scale supervised data (e.g., Video-RFT with 100K+ CoT samples), using only 32 unlabeled samples at test time.

Qualitative Analysis

Qualitative comparison of frame selection strategies. Random sampling (left) vs. our learned selection (right) on two VideoMMMU examples. Our adaptive frame selection learns to prioritize frames containing task-relevant information, leading to correct answers where random sampling fails.

Learned Frame Distributions

Uniform initialization

CLIP-based initialization

The multi-armed bandit learns to concentrate sampling probability on informative frames. Over training, the distribution shifts from uniform/CLIP-based to task-relevant frame selection.

BibTeX

@article{jahagirdar2026tta, title={TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning}, author={Jahagirdar, Soumya Shamarao and Araujo, Edson and Kukleva, Anna and Mirza, M Jehanzeb and Bhati, Saurabhchand and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio and Glass, James R and Kuehne, Hilde}, journal={arXiv preprint arXiv:2604.00696}, year={2026} }