Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

✨ NeurIPS 2025 Spotlight ✨

Huanlin Gao1,2 Ping Chen1,2 Fuyuan Shi1,2 Chao Tan1,2 Zhaoxiang Liu1,2
Fang Zhao1,2 Kai Wang1,2 Shiguo Lian1,2
1Data Science & Artificial Intelligence Research Institute, China Unicom,  2Unicom Data Intelligence, China Unicom
(* Equal contribution. † Corresponding author.)

Keyframe Comparison

Abstract

We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9× speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis.


LeMiCa Teaser Figure
Figure 1: Comparison between LeMiCa (globally controlled cache) and traditional local greedy cache methods. Top: Schematic of different caching strategies. Bottom: LeMiCa outperforms baselines in consistency and speed.

LeMiCa

Motivation: Limitations of Local-Greedy Caching

Existing cache methods rely on local error metrics (e.g., L1 relative distance between adjacent timesteps) and fixed thresholds to decide caching, which suffers from two key limitations:

  1. Temporal Heterogeneity Ignorance: Diffusion denoising has varying noise levels and semantic importance across timesteps (early steps shape structure, late steps refine details), making uniform thresholds inaccurate.
  2. Global Error Accumulation: Local error minimization overlooks long-term error propagation, leading to content inconsistency and quality degradation in final videos.
Local vs Global Error
(a) Local vs. Global Cache Error Control
Segment-wise Global Error
(b) Segment-wise Error Trends
Figure 2: Rethinking cache reuse in denoising diffusion via error estimation. (a) The traditional Local-Greedy ($L_1^{rel}$) strategy uses fixed thresholds on local output differences between adjacent timesteps to decide when to cache. This assumes uniform temporal sensitivity, which can be misleading—for instance, caching at $t_2$ yields lower final error than $t_1$, despite $t_1$ seeming smoother locally. This highlights the role of temporal heterogeneity. (b) Our Global Outcome-Aware (segment-wise error) strategy estimates final output error when caching outputs over segments of length $len$, starting from timestep $i$. The plot shows that early caches cause greater error, supporting an outcome-sensitive, trajectory-aware strategy over fixed local heuristics.

Implementation

LeMiCa formulates cache scheduling as a global optimization problem on a directed acyclic graph (DAG). Each node represents a timestep, and each edge (i → j) means full inference is done at i and j while reusing cached results in between. The goal is to control global error accumulation while reducing computation.

$$L_1^{glob}(i \rightarrow j) = \frac{1}{N} \| x^{cache(i \rightarrow j)}_0 - x^{original}_0 \|_1$$

Here, \(x^{original}_0\) is the output without caching, and \(x^{cache(i \rightarrow j)}_0\) is the output after reusing intermediate caches. This metric reflects the real downstream impact of caching rather than local step-wise differences.

Given a computation budget \(B\), LeMiCa finds a path from the start node \(s\) to the end node \(t\) that minimizes the worst error along the path:

$$\min_{P \in \mathcal{P}^{(B)}_{s \to t}} \mathrm{LexMax}\big(\mathrm{sort}_{desc}\{w(e) \mid e \in P_{cache}\}\big)$$

The lexicographic minimax strategy first reduces the largest cache error, then the next largest, ensuring smooth and consistent generation. This global planning avoids the unstable behavior of local greedy caching and achieves both speedup and fidelity.

Evaluations

Quantitative Results

Qualitative or Ablation Results

Visual Results

Open-Sora

Original
Latency: 26.54s
TeaCache-slow
Latency: 17.58s
LeMiCa-slow
Latency: 17.43s
TeaCache-fast
Latency: 12.63s
LeMiCa-fast
Latency: 10.86s

Extension to Text-to-Image

Qwen-Image

Original
Latency: 31.22s
LeMiCa (B=25)
Latency: 17.93s
LeMiCa (B=17)
Latency: 13.52s
LeMiCa (B=10)
Latency: 9.84s
Qwen-Image visual result


Comparison of inference efficiency and visual quality of LeMiCa and other acceleration methods on the Qwen-Image.

Table comparing LeMiCa with other acceleration methods on Qwen-Image

BibTeX

    
  @inproceedings{gao2025lemica,
    title     = {LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation},
    author    = {Huanlin Gao and Ping Chen and Fuyuan Shi and Chao Tan and Zhaoxiang Liu and Fang Zhao and Kai Wang and Shiguo Lian},
    journal = {Advances in Neural Information Processing Systems (NeurIPS)},
    year      = {2025},
    url       = {https://arxiv.org/abs/2511.00090}
  }