Motivation:
Despite advances, video diffusion transformers still struggle to generalize beyond their training length.
We identify two failure modes: model-specific periodic content repetition and a universal quality degradation.
Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation.
In this paper, we revisit this challenge from a more fundamental view— attention maps,
which directly govern how context influences outputs.
Analysis:
We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings.
Method:
Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2× to 4×
@article{zhao2025riflex,
title={Riflex: A free lunch for length extrapolation in video diffusion transformers},
author={Zhao, Min and He, Guande and Chen, Yixiao and Zhu, Hongzhou and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2502.15894},
year={2025}
}