Video generated by Mochi+STG

Spatiotemporal Skip Guidance
for Enhanced Video Diffusion Sampling

Junha Hyung*¹, Kinam Kim*¹, Susung Hong², Min-Jung Kim¹, Jaegul Choo¹,

*Equal contribution

¹KAIST AI ²University of Washington

CVPR 2025

Abstract

We introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training. By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include:

(1) Introducing STG as an efficient, high-performing guidance technique for video diffusion models
(2) Eliminating the need for auxiliary models by simulating a weak model through layer skipping
(3) Enhancing video quality without compromising sample diversity or dynamics unlike CFG

Mochi

Stable-Video-Diffusion (SVD)

Open-Sora

STG Enhancement Details

Comparison between CFG and STG, with the band conceptually representing the noisy data manifold.

Spatiotemporal Skip Guidance (STG) improves video quality by selectively skipping residual blocks or attention layers. This approach simplifies outputs while preserving frame diversity and dynamics.

Residual Skip: Skips entire residual blocks:

\[ \text{Res}(z_l) = z_{l+1} = f_l(z_l) + z_l, \quad \text{Res}'(z_l) = z_{l+1} = z_l. \]

Attention Skip: Adjusts self-attention to simplify computation:

\[ \text{SA}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV. \]

By introducing these simple yet effective modifications, STG avoids the need for additional training or auxiliary models, achieving high-quality video synthesis efficiently.

Evaluation

Quantitative Results

Models	Imaging Quality	Aesthetic Quality	Motion Smoothness	Dynamic Degree	Temporal Flickering
Mochi (CFG)	0.524	0.507	0.985	0.87	0.976
Mochi (STG)	0.628	0.554	0.988	0.86	0.978
Open-Sora (CFG)	0.561	0.493	0.982	0.902	0.975
Open-Sora (STG)	0.606	0.509	0.987	0.895	0.976

Table 1. Quantitative results for Mochi and Open-Sora on VBench T2V benchmarks.

Models	FVD (↓)	IS	Imaging Quality	Aesthetic Quality	Motion Smoothness	Dynamic Degree
SVD (CFG)	151.3	38.0	0.687	0.637	0.966	0.562
SVD (STG)	128.7	38.5	0.694	0.639	0.968	0.694

Table 2. Quantitative results for SVD on FVD, IS, and VBench I2V benchmarks.

Comparison of CFG and STG across varying scales in terms of Imaging Quality and FVD.

User Study

BibTeX


@article{hyung2024spatiotemporal,
  title={Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling},
  author={Hyung, Junha and Kim, Kinam and Hong, Susung and Kim, Min-Jung and Choo, Jaegul},
  journal={arXiv preprint arXiv:2411.18664},
  year={2024}
}

Spatiotemporal Skip Guidancefor Enhanced Video Diffusion Sampling