The multi-step sampling mechanism, a key feature of visual diffusion models, has significant potential to replicate the success of OpenAI's Strawberry in enhancing performance by increasing the inference computational cost. Sufficient prior studies have demonstrated that correctly scaling up computation in the sampling process can successfully lead to improved generation quality, enhanced image editing, and compositional generalization. While there have been rapid advancements in developing inference-heavy algorithms for improved image generation, relatively little work has explored inference scaling laws in video diffusion models (VDMs). Furthermore, existing research shows only minimal performance gains that are perceptible to the naked eye. To address this, we design a novel training-free algorithm IV-Mixed Sampler that leverages the strengths of image diffusion models (IDMs) to assist VDMs surpass their current capabilities. The core of IV-Mixed Sampler is to use IDMs to significantly enhance the quality of each video frame and VDMs ensure the temporal coherence of the video during the sampling process. Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-Bench-1649. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0.
Specifically, 1) we construct IV-mixed Sampler under a rigorous mathematical framework and demonstrate, through theoretical analysis, that it can be elegantly transformed into a standard inverse ordinary differential equation (ODE) process. For the sake of intuition, we present IV-mixed Sampler (w.r.t., “IV-IV”) on Fig. 3 and its pseudo code in Appendix B. 2) The empirically optimal IV-mixed Sampler further reduces the UMT-FVD by approximately 10 points compared to the best “I-I” in Fig. 2 and by 39.72 points over FreeInit. Furthermore, 3) we conduct sufficient ablation studies to determine which classifier-free guidance (CFG) scale and which sampling paradigm yield the best performance for various metrics at what sampling intervals. In addition to this, 4) qualitative and quantitative comparison experiments have amply demonstrated that our algorithm achieves state-of-the-art (SOTA) performance on four popular benchmarks: UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150, and Chronomagic-1649. These competitive outcomes demonstrate that our proposed IV-mixed Sampler dramatically improves the visual quality and semantic faithfulness of the synthesized video.
A serene forest clearing at dawn, where deer graze peacefully while golden rays of sunlight pierce through the mist-laden trees.
An explorer standing at the edge of a massive desert canyon, with swirling sands below and towering rock formations stretching into the distance.
A young girl sitting on a windowsill, staring at a rainy cityscape while holding a steaming cup of tea.
A futuristic cityscape at night, with glowing neon signs, flying cars weaving through towering skyscrapers, and bustling street markets below.
Visualization of IV-mixed Sampler and the standard DDIM sampling on Animatediff and VideoCrafterV2. Unlike prior heavy-inference approaches, IV-mixed Sampler is able to significantly improve the fidelity of the video while guaranteeing semantic faithfulness.
In this paper, we propose IV-mixed Sampler to enhance the visual quality of synthesized videos by leveraging an IDM while ensuring temporal coherence through a VDM. The algorithm utilizes DDIM and DDIM-Inversion to correct latent representations x_t at any time point t, enabling seamless integration into any VDM and sampling interval. IV-mixed Sampler can be formulated as an ODE, achieving a trade-off between visual quality and temporal coherence by adjusting the CFG scales of both the IDM and VDM. In the future, we plan to fine-tune several stronger IDMs, such as FLUX, to better adapt the latent space of target VDMs, thereby further enhancing the performance of VDMs. We anticipate IV-mixed Sampler will be widely applicable in vision generation tasks.