FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Attention

Abstract

Video diffusion models have made substantial progress in various video generation applications. However, training models for long video tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to adapt an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Preliminary observations indicate that while generating long videos, the short video model's temporal attention mechanism ensures temporal consistency but significantly degrades the fidelity and spatial-temporal details of the videos. Our further investigation reveals that the limitation is mainly due to the distortion of high-frequency components in generated long videos. Motivated by this finding, we propose a straightforward yet effective solution: the local-global SpectralBlend Temporal Attention (SB-TA). This approach smooths the frequency distribution of long video features during the denoising process by blending the low-frequency components of global video features with the high-frequency components of local video features. This fusion enhances both the consistency and fidelity of long video generation. Based on the SB-TA, we developed a new training-free model named FreeLong, which sets a new performance benchmark compared to existing long video generation models. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.

FreeLong Framework

We propose FreeLong for consistent long and high-fidelity video generation. The core of our method is SpectralBlend Temporal Attention (SB-TA) to blend the low-frequency components of the global video feature with the high-frequency components of the local video feature. The local video feature is obtained by masking temporal attention that attends only to local frames for each frame. In contrast, the global temporal attention attends to all frames in the long video. We project local and global features into the frequency domain through 3D FFT and add two components together. After transforming back to the time domain with IFFT, the refined blended feature is used in the next block.

Long videos generated before and after FreeLong.

Comparison with Previous Methods.

Ablation Study.

Multi-Prompt Videos.

Longer Videos.