Video generated by Mochi+STG

Spatiotemporal Skip Guidance
for Enhanced Video Diffusion Sampling

*Equal contribution
1KAIST AI   2University of Washington

Abstract

We introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include:

(1) Introducing STG as an efficient, high-performing guidance technique for video diffusion models
(2) Eliminating the need for auxiliary models by simulating a weak model through layer skipping
(3) Enhancing video quality without compromising sample diversity or dynamics unlike CFG

Mochi

Stable-Video-Diffusion (SVD)

Open-Sora

STG Enhancement Details

Comparison between CFG and STG
Comparison between CFG and STG, with the band conceptually representing the noisy data manifold.

Spatiotemporal Skip Guidance (STG) improves video quality by selectively skipping residual blocks or attention layers. This approach simplifies outputs while preserving frame diversity and dynamics.

Residual Skip: Skips entire residual blocks:

\[ \text{Res}(z_l) = z_{l+1} = f_l(z_l) + z_l, \quad \text{Res}'(z_l) = z_{l+1} = z_l. \]

Attention Skip: Adjusts self-attention to simplify computation:

\[ \text{SA}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV. \]

By introducing these simple yet effective modifications, STG avoids the need for additional training or auxiliary models, achieving high-quality video synthesis efficiently.

Evaluation

Quantitative Results

Models Imaging Quality Aesthetic Quality Motion Smoothness Dynamic Degree Temporal Flickering
Mochi (CFG) 0.524 0.507 0.985 0.87 0.976
Mochi (STG) 0.628 0.554 0.988 0.86 0.978
Open-Sora (CFG) 0.561 0.493 0.982 0.902 0.975
Open-Sora (STG) 0.606 0.509 0.987 0.895 0.976

Table 1. Quantitative results for Mochi and Open-Sora on VBench T2V benchmarks.

Models FVD (↓) IS Imaging Quality Aesthetic Quality Motion Smoothness Dynamic Degree
SVD (CFG) 151.3 38.0 0.687 0.637 0.966 0.562
SVD (STG) 128.7 38.5 0.694 0.639 0.968 0.694

Table 2. Quantitative results for SVD on FVD, IS, and VBench I2V benchmarks.

Comparison of CFG and STG
Comparison of CFG and STG across varying scales in terms of Imaging Quality and FVD.

User Study

User Study Results
User study results for STG on SVD and Mochi, using 700 prompts from EvalCrafter.

BibTeX


@misc{hyung2024spatiotemporalskipguidanceenhanced,
      title={Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling}, 
      author={Junha Hyung and Kinam Kim and Susung Hong and Min-Jung Kim and Jaegul Choo},
      year={2024},
      eprint={2411.18664},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.18664}, 
}