✨UltraImage: Rethinking Resolution Extrapolation in
Image Diffusion Transformers

Min Zhao1 Bokai Yan2 Yang Xue3 Hongzhou Zhu1 Jintao Zhang1

Shilong Liu4 Chongxuan Li2 Jun Zhu1†

1 Tsinghua University    2 Renmin University of China
3 Beijing Institute of Technology    4 Princeton University

[Paper]      [Code]


Direct resolution extrapolation at 4K




Guided resolution extrapolation


Guidance (1024×1024) UltraImage (3600×3600)
Guidance (1024×1024) UltraImage (3600×3600)
Guidance (1024×1024) UltraImage (3600×3600)

Guided view extrapolation


Guidance (1024×1024) UltraImage (3600×3600)
Guidance (1024×1024) UltraImage (3600×3600)
Guidance (1024×1024) UltraImage (3600×3600)

Motivation

Recent image diffusion transformers can generate high-fidelity images, but their ability to extrapolate beyond the training resolution is fundamentally limited. When scaling to higher resolutions, models often exhibit structure-level repetition and quality degradation, making them unable to synthesize coherent ultra-resolution images. These failure modes indicate gaps in our understanding of how positional embeddings and attention behaviors behave under extreme spatial extrapolation.

Analysis

Through a frequency-wise analysis of positional embeddings, we identify that content repetition arises from the periodicity of the dominant frequency, whose natural period coincides with the training resolution. Extrapolating beyond this range forces the model outside the original frequency cycle, inevitably producing repeated structures. In parallel, we find that quality degradation emerges from diluted attention at large token counts: local details become blurry due to overly diffuse attention, while global patterns lose structural consistency when attention becomes excessively concentrated.

Method

Building on these findings, we propose UltraImage, a principled framework for ultra-resolution extrapolation without additional training data.

Together, these components enable UltraImage to robustly extend image resolution far beyond the training scale, achieving consistent improvements over prior approaches and supporting generation up to 6K × 6K from only 1328p training data.

BibTex

If you find this work helpful, please cite the following paper:

  @article{zhao2025UltraImage,
    title={UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers},
    author={Zhao, Min and Yan, Bokai and Xue, Yang and Zhu, Hongzhou and Zhang, Jintao and Liu, Shilong and Li, Chongxuan and Zhu, Jun},
    journal={},
    year={}
  }
  

Project page template is borrowed from HiFlow.