DFlash: What If Your Draft Model Wasn't Autoregressive?

Speculative decoding is one of the cleanest ideas in LLM inference. You have a big, slow target model that produces high-quality text. You pair it with a small, fast draft model that guesses what the big model would say. The big model verifies those guesses in parallel — and since verification is cheap (just one forward pass regardless of how many tokens you're checking), you get a speedup for free. No quality loss. The output is identical to what the target model would have produced alone.

The entire game is in the draft model: how fast can it generate guesses, and how often does the target model accept them?

Every major speculative decoding system — Medusa, EAGLE, EAGLE-2, EAGLE-3 — uses an autoregressive drafter. That is, the draft model generates tokens one at a time, sequentially. DFlash, from Zhijian Liu's group at UCSD (arXiv:2602.06036), asks a simple question: what if the drafter was a diffusion model instead?

The answer turns out to be a 6x lossless speedup on Qwen3-8B, and 2.5x faster than EAGLE-3.

The Autoregressive Drafting Bottleneck

Here's the problem with autoregressive drafters. If you want to draft $\gamma$ tokens, you need $\gamma$ sequential forward passes through the draft model. Each pass depends on the previous one. The drafting cost scales linearly:

$$T_{\text{draft}}^{\text{AR}} = \gamma \cdot t_{\text{step}}$$

To keep drafting fast, you're forced to use an extremely shallow architecture — EAGLE-3 uses a single transformer layer. This keeps $t_{\text{step}}$ small, but it also caps the drafter's expressiveness. A 1-layer model can only capture so much about the target model's behavior.

You're stuck in a trade-off: more layers means better draft quality (higher acceptance rate $\tau$) but slower drafting. Fewer layers means faster drafting but worse guesses. In practice, autoregressive drafters plateau around 3-4x speedup.

The Diffusion Drafting Insight

Diffusion models generate all tokens in a block simultaneously. You start with a block of masked tokens and denoise them in a single forward pass. The drafting cost is:

$$T_{\text{draft}}^{\text{diff}} = t_{\text{parallel}}$$

This cost is essentially independent of $\gamma$. Whether you're drafting 4 tokens or 16, it's one forward pass. Modern GPUs execute parallel operations much faster than sequential ones, so $t_{\text{parallel}} \ll \gamma \cdot t_{\text{step}}$ for the same model size.

This changes the design space completely. Since drafting cost doesn't grow with $\gamma$, you can afford a deeper, more expressive drafter without paying a latency penalty. A 5-layer diffusion drafter generating 16 tokens has lower latency than a 1-layer EAGLE-3 generating 8 tokens.

But Naive Diffusion Drafting Doesn't Work

The natural thing to try is training a standalone block diffusion model and using it as the drafter. People have tried this. The results are mediocre — around 2-3x speedup. The diffusion model doesn't have enough context about what the target model is thinking.

This is where DFlash's key insight comes in: the target model already knows the future.

Samragh et al. (2025) showed that the hidden representations of autoregressive LLMs implicitly encode information about multiple future tokens — not just the next one. The features at position $t$ contain predictive signal about tokens at positions $t+1, t+2, t+3, \ldots$ These are rich, high-dimensional representations that capture long-range dependencies, task-specific semantics, and future-token predictions all at once.

DFlash's idea: extract these hidden features from the target model and inject them into the diffusion drafter as conditioning context. The drafter doesn't have to figure out the future from scratch — it gets a compressed summary of what the target model already computed.

The Architecture

Here's how DFlash works in practice:

Step 1: The target model (e.g. Qwen3-8B) processes the prompt and generates one token autoregressively. During this forward pass, it also produces hidden representations at every layer.

Step 2: DFlash extracts hidden states from a fixed set of layers — sampled uniformly from shallow to deep (e.g., 5 layers from layer 2 to the second-to-last). These are concatenated and passed through a lightweight projection layer to produce a single target context feature.

Step 3: This context feature is injected into the diffusion drafter's KV cache — not as an input embedding, but as persistent Key and Value projections in every draft layer. The drafter then denoises a block of masked tokens in one forward pass, using bidirectional attention (since all positions are generated simultaneously, there's no causal constraint).

Step 4: The target model verifies the draft block in a single forward pass. Accepted tokens go to the output. The bonus token from verification becomes the anchor for the next drafting cycle.

Why KV Injection Matters

This is the architectural detail that makes DFlash work, and it's worth pausing on.

EAGLE-3 also extracts hidden features from the target model. But it fuses those features with the draft model's token embeddings and feeds them as inputs. As the draft model generates tokens autoregressively, each new token's representation gets further from the original target features — the context information gets diluted through the sequential generation process.

DFlash does something different. The target context feature is injected directly into the KV cache of every draft layer. This means every draft token, at every layer, attends to the same rich context from the target model. The information doesn't degrade with position or depth.

This is why DFlash's acceptance rate scales with depth. Adding more layers to the draft model consistently improves quality because every layer has access to the full target context. In EAGLE-3, adding layers has diminishing returns because the context is only available at the input.

The numbers from the ablation bear this out: a 3-layer DFlash achieves 3.18 average acceptance length, 5-layer gets 3.37, and 8-layer reaches 3.50. The gains are steady, not diminishing.

Training Tricks

A few clever training choices that matter:

Random anchor sampling. In standard block diffusion training, blocks start at fixed positions. DFlash randomly samples an "anchor token" within the response, uses it as the start of a block, and masks the rest. This mirrors inference behavior (where the anchor is always the last verified token) and provides much better data coverage. Ablation shows this alone improves speedup from 3.30x to 4.91x.

Exponential position weighting. In speculative decoding, early positions matter more than late ones — an error at position 1 invalidates everything after it. DFlash weights the loss with $w_k = \exp(-(k-1)/\gamma)$, putting more emphasis on getting the first few tokens right. This changes the loss landscape to prioritize acceptance length over raw token accuracy.

Shared embedding and LM head. The draft model borrows the target model's token embedding layer and language modeling head, freezing both. Only the transformer layers in between are trained. This keeps the drafter small (just a few transformer layers) and tightly aligned with the target model's representation space.

The Results

On Qwen3-8B across math, code, and chat benchmarks:

Greedy decoding (temp=0): DFlash achieves 4.91x average speedup vs 2.08x for EAGLE-3. On some benchmarks (GSM8K), DFlash hits 6.09x.
Sampling (temp=1): DFlash gets 4.07x vs 1.93x for EAGLE-3.
SGLang serving: On a single B200 GPU, DFlash outperforms EAGLE-3 across all concurrency levels (1 to 32), achieving up to 5.1x speedup on Qwen3-8B.
Reasoning models (thinking mode): 4.5x and 3.9x speedup, vs 3.67x and 3.75x for EAGLE-3. The high acceptance rates make DFlash particularly useful for long-form reasoning where CoT traces run to thousands of tokens.

The acceptance length comparison is telling: DFlash with block size 16 achieves $\tau = 7.27$ on Qwen3-4B at temperature 0. That means on average, 7.27 out of every 16 drafted tokens get accepted. EAGLE-3, with a tree size of 60, achieves $\tau = 3.51$.

Why This Matters

Three things stand out to me:

1. The Pareto frontier shifted. Before DFlash, speculative decoding was stuck at 2-4x. DFlash pushes it to 5-6x by fundamentally changing the drafting architecture. This isn't an incremental improvement — it's a new design point that was previously inaccessible.

2. Diffusion found its niche. Diffusion language models have been trying (and mostly failing) to compete with autoregressive models on end-to-end generation quality. DFlash sidesteps that entire competition. It doesn't need the diffusion model to produce final output — just to produce good enough drafts that survive verification. The target model handles quality. Diffusion handles speed. Each paradigm does what it's good at.

3. The "target knows the future" observation is the real enabler. Without KV injection of the target model's hidden states, diffusion drafting is mediocre. The paper includes an ablation where they train a standalone diffusion drafter without any target conditioning — it barely hits 2-3x. The hidden representations are what make it work, and they come for free (the target model already computed them during prefill).

This connects to something I find deeply interesting: the idea that autoregressive models encode far more than just the next token. Their hidden states are compressed representations of the entire future trajectory. DFlash is the first paper I've seen that exploits this observation for a practical systems-level gain rather than just a scientific curiosity.

For anyone running inference at scale, this is worth watching. The code and models are open. SGLang integration exists. The practical barrier to trying this is low.

Paper: DFlash: Block Diffusion for Flash Speculative Decoding, Jian Chen, Yesheng Liang, Zhijian Liu. UC San Diego, February 2026.