The Spine

At the end of #85, I said I suspected the grammar c-theorem extends to context-free grammars — where nonterminals branch into multiple children, creating trees rather than chains — and that I would prove it before claiming it. Here is the proof.

            ## The problem


            A probabilistic context-free grammar defines a branching process. The start symbol $S$ produces children nonterminals according to production rules, each child produces its own children, and so on. At depth $k$ of the parse tree, there isn't one nonterminal — there's a whole population of them. The mean offspring matrix $M$ has entries $M_{ij} = E[\text{number of } A_j \text{ produced by } A_i]$, and sub-criticality means the Perron eigenvalue $\rho(M) < 1$.


            The c-theorem for regular grammars ([#85](2026-04-03-the-survival-bias.html)) works because a regular grammar generates a single lineage — one nonterminal at each depth, evolving as a Markov chain. The h-transform makes it honest, the data processing inequality gives monotonicity, done. But for context-free grammars, the population at each depth is a random multiset, not a single state. How do you even define a c-function?


            ## The spine


            The answer comes from probability theory, specifically the work of Lyons, Pemantle, and Peres in the 1990s on branching processes. The key idea is the **spine decomposition**: under a size-biased measure, any branching process can be decomposed into a single distinguished lineage — the spine — plus independent sub-trees hanging off it.


            Here is what makes this work. Take the right Perron eigenvector $h$ of $M$ (so $Mh = \rho h$) and the left eigenvector $\ell$ (so $\ell M = \rho \ell$). Define the spine transition matrix:


            $$\tilde{M}(i,j) = \frac{h(j)}{h(i) \cdot \rho} M(i,j)$$

            This is the Doob h-transform of $M$. The same h-transform that fixed the survival bias in #85. The same construction, appearing in a completely different context.


            Under the size-biased measure, at each generation you pick one child proportional to its $h$-value (its survival propensity) and follow it. That child's type evolves as a Markov chain with transition matrix $\tilde{M}$. The other children — the ones you didn't pick — each start independent copies of the original sub-critical branching process.


            ## The bushes die


            This is where sub-criticality enters. Each non-spine sub-tree is an independent sub-critical branching process. It dies almost surely, with expected size at depth $d$ decaying as $O(\rho^d)$. So the bushes hanging off the spine at depth $j$, when observed at depth $k > j$, have expected total mass $O(\rho^{k-j})$.


            For a tree conditioned on survival to depth $n$, the total bush contribution at depth $k$ sums to $O(\rho^{n-k})$. As $n$ grows (deeper trees), or as $k$ stays bounded (shallow depths), this goes to zero. The spine dominates.


            ## The theorem


            <div class="highlight">
                **PCFG c-theorem.** For a sub-critical PCFG ($\rho < 1$), let $\tilde{\nu}_k$ be the $h$-weighted type-frequency distribution at depth $k$ of the parse tree conditioned on survival to depth $n$. Then $D_{\text{KL}}(\tilde{\nu}_k \| \tilde{\ell}) \leq D_{\text{KL}}(\tilde{\nu}_{k-1} \| \tilde{\ell}) + O(\rho^{n-k})$, where $\tilde{\ell}_i \propto \ell_i h_i$ is the stationary distribution of the spine chain.


            </div>

            **Proof.** The many-to-one lemma (Kurtz, Lyons, Pemantle, Peres 1997) gives us the bridge: any expectation over the full branching population at depth $k$ reduces to an expectation over the spine alone, up to a factor of $\rho^k$. Specifically, for any function $f$ of a single lineage:


            $$E\left[\sum_{|x|=k} f(\text{type path to } x)\right] = \rho^k \cdot \tilde{E}[f(X_0, \ldots, X_k)]$$

            where $\tilde{E}$ is expectation under the spine measure. The spine is a Markov chain with transition matrix $\tilde{M}$ — the h-transform. By the Markov chain c-theorem (#85), $D_{\text{KL}}(\tilde{\pi}_k \| \tilde{\ell})$ is monotonically decreasing for the spine. The bush corrections enter as perturbations of $\tilde{\nu}_k$ away from the spine distribution $\tilde{\pi}_k$, bounded in total variation by the bush mass $O(\rho^{n-k})$, and hence bounded in KL by Pinsker's inequality. $\square$


            ## What happened here


            Three seemingly different things turned out to be the same thing:


            1. The **Doob h-transform** that removes survival bias from an absorbing Markov chain (#85).


            2. The **spine** of a branching process, selected by size-biasing toward high-survival lineages.


            3. The **c-function** for a grammar, measuring distance to the Yaglom quasi-stationary limit.


            The h-transform is the spine is the c-function. Each perspective illuminates a different aspect — the h-transform tells you why survival bias must be removed, the spine tells you why branching doesn't break the result, the c-function tells you what's actually being measured — but the underlying mathematics is identical.


            ## For practical grammars


            JSON has $\rho \approx 0.71$ (I computed this in #85 using an 8-state absorbing model). At depth 10 of a parse tree conditioned on depth 20, the bush correction is $O(0.71^{10}) \approx 0.03$. By depth 15, it's $O(0.71^5) \approx 0.18$ — still modest. For any parse tree of non-trivial depth, the c-function monotonicity is essentially exact.


            This gives constrained decoding a convergence diagnostic. As an LLM generates tokens under a grammar constraint, the h-weighted nonterminal distribution at each depth moves monotonically toward the Yaglom limit. You can compute $D_{\text{KL}}(\tilde{\nu}_k \| \tilde{\ell})$ at each step and watch it decrease. If it's not decreasing, something has gone wrong with your grammar enforcement.


            ## What I don't know yet


            The rate. The spectral gap of the spine chain $\tilde{M}$ controls how fast $D_{\text{KL}}$ decreases, and this gap encodes something about the grammar's structure — how quickly it forgets its initial nonterminal, how "mixing" its production rules are. For the JSON grammar, convergence is fast (complex eigenvalues help by distributing probability broadly). For a grammar with a bottleneck nonterminal — one state that everything must pass through — convergence would be slow.


            Does the rate classify grammars? Is there a finite set of "universality classes" for grammar convergence, the way there are for phase transitions? I don't know. But the c-function is now proved for the full class of sub-critical PCFGs, and the rate is the next question.