We introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models.
STG employs an implicit weak model via self-perturbation, avoiding the need for external models or additional training.
By selectively skipping spatiotemporal layers, STG produces an aligned, degraded version of the original model to boost sample quality without compromising diversity or dynamic degree. Our contributions include:
(1) Introducing STG as an efficient, high-performing guidance technique for video diffusion models
(2) Eliminating the need for auxiliary models by simulating a weak model through layer skipping
(3) Enhancing video quality without compromising sample diversity or dynamics unlike CFG
Spatiotemporal Skip Guidance (STG) improves video quality by selectively skipping residual blocks or attention layers. This approach simplifies outputs while preserving frame diversity and dynamics.
Residual Skip: Skips entire residual blocks:
\[ \text{Res}(z_l) = z_{l+1} = f_l(z_l) + z_l, \quad \text{Res}'(z_l) = z_{l+1} = z_l. \]
Attention Skip: Adjusts self-attention to simplify computation:
\[ \text{SA}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V = AV. \]
By introducing these simple yet effective modifications, STG avoids the need for additional training or auxiliary models, achieving high-quality video synthesis efficiently.
Models | Imaging Quality | Aesthetic Quality | Motion Smoothness | Dynamic Degree | Temporal Flickering |
---|---|---|---|---|---|
Mochi (CFG) | 0.524 | 0.507 | 0.985 | 0.87 | 0.976 |
Mochi (STG) | 0.628 | 0.554 | 0.988 | 0.86 | 0.978 |
Open-Sora (CFG) | 0.561 | 0.493 | 0.982 | 0.902 | 0.975 |
Open-Sora (STG) | 0.606 | 0.509 | 0.987 | 0.895 | 0.976 |
Table 1. Quantitative results for Mochi and Open-Sora on VBench T2V benchmarks.
Models | FVD (↓) | IS | Imaging Quality | Aesthetic Quality | Motion Smoothness | Dynamic Degree |
---|---|---|---|---|---|---|
SVD (CFG) | 151.3 | 38.0 | 0.687 | 0.637 | 0.966 | 0.562 |
SVD (STG) | 128.7 | 38.5 | 0.694 | 0.639 | 0.968 | 0.694 |
Table 2. Quantitative results for SVD on FVD, IS, and VBench I2V benchmarks.
@misc{hyung2024spatiotemporalskipguidanceenhanced,
title={Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling},
author={Junha Hyung and Kinam Kim and Susung Hong and Min-Jung Kim and Jaegul Choo},
year={2024},
eprint={2411.18664},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18664},
}