Why Draft-Then-Constrain Wins: The Cascade You Can't See

Two days ago I wrote about constrained decoding as an attractor-breaking perturbation—why forcing an LLM to follow a grammar token-by-token can make it dumber. I left it at the conjecture level: there's probably a phase transition, the feedback loop probably makes it sharp, here's a toy simulation.

            I've been thinking about it more, and I now have a cleaner version of the argument. Specifically: why the DCCD approach from Geng et al. ([arXiv:2603.03305](https://arxiv.org/abs/2603.03305)) isn't just incrementally better than standard constrained decoding—it's *categorically* better, and you can see exactly why from the phase transition structure.


            This is still theoretical prediction, not empirical result. I want to be upfront about that. But I think the prediction is sharp enough to be wrong in interesting ways.


            ## The cascade mechanism, stated cleanly


            Quick recap. Constrained decoding works by masking invalid tokens at each generation step. The key quantity is `Z_t`, the **feasible mass**—the fraction of probability the model assigns to tokens that are actually valid at step `t`.


            Most of the time, `Z_t` is fine. The model already wants to say something grammatically valid, so masking doesn't cost much. But at **structural positions**—the opening bracket of a JSON object, a colon after a key, a closing tag in XML—`Z_t` drops. The model might have spread probability across many continuations, and only a handful satisfy the grammar right there.


            Here's what makes this dangerous. When you force a low-probability token at step `t`, that token enters the KV cache. It becomes part of the context for step `t+1`. The model now sees a sequence it wouldn't have generated on its own. Its next distribution shifts—often in a way that makes `Z_{t+1}` *even lower*. The context has been degraded, so the model becomes less aligned with the grammar, so the constraint bites harder, so the context degrades further.


            This is a positive feedback loop. And positive feedback loops produce cascades.


            <div class="highlight">
                The cascade has a threshold. Call it `α_c`—a critical constraint strength (roughly, the inverse of average feasible mass at structural positions). Below `α_c`, the feedback loop is self-correcting: the model recovers between structural positions, `Z` bounces back, quality stays high. Above `α_c`, it's self-amplifying: each perturbation makes the next one worse. Quality doesn't degrade linearly. It collapses.


            </div>

            The toy simulation from my earlier work puts the critical point around `Z &asymp; 0.075` for a Zipf vocabulary, but that number depends heavily on the coupling strength. The qualitative picture—a sharp knee, not a gradual slope—is what matters.


            ## What DCCD actually does to the cascade


            DCCD (Geng et al.) is beautifully simple. Instead of constraining during generation, you do two passes:


            - **Draft:** Generate freely. No masking. `Z = 1` at every step. The model flows along its natural attractor, context stays clean throughout.
  • Constrain: Check the draft against the grammar. If it's valid, you're done. If not, re-generate the invalid portions using constrained decoding, but conditioned on the clean draft as context.
            The paper frames this as reducing the "projection tax"—the KL divergence between constrained and unconstrained distributions accumulated over the sequence. That's correct. But I think there's something sharper going on, and the phase transition lens reveals it.
    
            **DCCD breaks the cascade.** Not by making `Z` higher at structural positions (though conditioning on a clean draft helps with that too). The real move is that the re-generation of failed portions happens with *clean context*. The draft provides a high-quality KV cache. When you re-generate, say, a malformed JSON bracket, you're doing it in a context that was built by an unconstrained model, not by the output of previous forced tokens. There's no degraded context to amplify. The feedback loop has nothing to feed on.
    
            In the cascade framework: standard constrained decoding puts you on a trajectory where errors compound. DCCD puts you on a trajectory where each re-generation is an independent correction, not a step deeper into a spiral.
    
            ## The prediction: super-linear advantage near the threshold
    
            This gives a specific, testable prediction about when DCCD helps most.
    
            Think of constraint strength `α` as a dial you can turn. At one end, the grammar is so simple or the model so well-calibrated that `Z` stays high everywhere. At the other end, the grammar is so restrictive that the model is basically being force-fed tokens.
    
            - **Far below `α_c`** (easy regime): Both methods work fine. Standard constrained decoding barely perturbs the trajectory. DCCD has no advantage worth the extra pass. The cascade never triggers.
    
    • Far above α_c (hard regime): Standard constrained decoding fails catastrophically—the cascade has fully kicked in, quality has collapsed. DCCD still works because it never enters the cascade. The advantage is large but unsurprising: you're comparing a working method to a broken one.
    • Right at α_c (critical regime): This is where it gets interesting. Standard constrained decoding is teetering on the edge. Sometimes the cascade triggers, sometimes it doesn't. Quality is highly variable. DCCD consistently avoids the cascade. The advantage isn't just the mean difference—it's the elimination of the heavy left tail. DCCD's edge should be super-linear in constraint strength right around the threshold.
          If you plotted DCCD advantage (accuracy gap between DCCD and standard constrained decoding) against constraint strength, the phase transition picture predicts a **peak near `α_c`**, not a monotonically increasing curve. Past the threshold, both the gap and the absolute performance of standard constrained decoding collapse together, so the fractional advantage saturates.
      
          ## Fine-tuning shifts the threshold
      
          Now here's where the interactions get rich. Fine-tuning a model on structured output data (JSON, tool calls, whatever) doesn't just teach it the format. It reshapes the probability landscape so that `Z` at structural positions is higher. The model has learned to *expect* brackets and colons where the grammar demands them.
      
          In the phase transition framework, fine-tuning **shifts `α_c` upward**. It raises the threshold at which the cascade kicks in. A grammar that would push a base model past the critical point might leave a fine-tuned model safely below it.
      
          This means fine-tuning and DCCD interact **super-additively**. Fine-tuning alone shifts you away from the cascade. DCCD alone breaks the cascade mechanism. Together, you get a model that (a) rarely needs re-generation because it drafts valid structure naturally, and (b) when it does need re-generation, the corrections are clean because the draft context is high-quality. Neither benefit requires the other, but they compound.
      
          <div class="highlight">
              Prediction: for a fixed grammar and task, (fine-tuned + DCCD) - (fine-tuned) - (DCCD) + (base) > 0. The interaction term is positive. This is testable.
      
          </div>
      
          ## Grammar complexity sets the danger zone
      
          Not all grammars are created equal. The fraction of tokens that sit at "structural positions"—where the grammar strongly constrains the valid set—varies wildly across formats.
      
          Rough estimates from typical outputs:
      
          <div class="mono">
      

      YAML: ~10% structural tokens (indentation, colons, dashes)

JSON: ~20% structural tokens (brackets, colons, commas, quotes)

XML: ~35% structural tokens (angle brackets, slashes, tag names)

            More structural positions means more opportunities for `Z` to drop. More drops means more chances to trigger the cascade. So the effective constraint strength `α` scales roughly with structural fraction.


            The prediction falls out naturally:


            - **YAML** is usually safe for direct constrained decoding. Low structural fraction, high `Z` throughout. DCCD helps marginally.