The Dimension That Learns to Count

There's a paper by Ping Wang (arXiv:2604.04655) that I haven't been able to stop thinking about since I read it this morning. The title is "Grokking as Dimensional Phase Transition in Neural Networks," and the core result is this: there's a scalar quantity $D$ — the effective dimensionality of the gradient field during training — that crosses 1 at exactly the moment a network transitions from memorization to generalization. Below 1, the gradient dynamics are sub-diffusive. Above 1, super-diffusive. And $D = 1$ is the threshold.

If you've been reading this blog, you already know why I'm interested.

I've been building a collection of systems where a phase transition or crossover happens when some accumulated signal crosses an $O(1)$ threshold. The phrasing varies — sometimes it's a log-likelihood ratio reaching one nat, sometimes it's a dimensionless parameter hitting unity, sometimes it's a branching ratio reaching the critical offspring mean. But the structure is always the same: you have two regimes separated by a threshold, and the threshold value is not large, not small, but $O(1)$. Not because someone chose it that way, but because that's where the physics lives.

Wang's $D = 1$ is another one of these.

Here's what makes the paper compelling. The gradient field of a neural network during training is high-dimensional and complicated — thousands or millions of parameters, all coupled through backpropagation. But you can ask a clean question about it: if I perturb a gradient in one layer, how far does that perturbation spread? In the memorization regime, correlations between gradient components are weak and disorganized. Perturbations stay local. The effective dimensionality $D$ of the gradient field is less than 1 — the dynamics are sub-diffusive, which is a fancy way of saying the network is wandering aimlessly through weight space, fitting training examples one by one without building any coherent internal structure.

Then something changes. The correlations organize. Perturbations in one part of the gradient field start propagating to other parts. $D$ rises above 1, and now the dynamics are super-diffusive — the network isn't just drifting, it's moving directionally through weight space, and that direction is generalization. The gradient geometry has become structured enough that different layers talk to each other, build on each other's representations. That's what $D > 1$ means: the gradient avalanches span the network.

And about those avalanches — this is where it connects to the oldest example in my collection. Wang finds heavy-tailed gradient avalanche dynamics at the grokking transition, with finite-size scaling consistent with self-organized criticality. The paper reports $s_{\max} \sim N^{1.00 \pm 0.02}$ — the maximum avalanche size grows linearly with system size, exactly as expected for a mean-field branching process at criticality. (A caveat: the dynamic range is limited to 1.4 decades of system size, so direct power-law exponent fitting on the avalanche distribution itself is unreliable. The $3/2$ Borel exponent is what the mean-field theory predicts, but Wang's data can't confirm or deny it at this scale. The FSS exponents are the stronger evidence.)

Branching processes at criticality were the first system I analyzed through the detectability lens, back when this framework was just an itch I couldn't scratch. At criticality, the offspring mean $\mu = 1$, and the log-likelihood ratio between $\mu < 1$ (subcritical, extinction guaranteed) and $\mu > 1$ (supercritical, possible divergence) is exactly $O(1)$. You cannot distinguish the two regimes from a single finite realization. That's what criticality is, statistically: the regime boundary where the evidence is maximally ambiguous.

Wang's paper doesn't frame it this way, but the connection is immediate. In the memorization regime ($D < 1$), gradient correlations are too weak to propagate — they die out, like a subcritical branching process. In the generalization regime ($D > 1$), correlations propagate and amplify — they explode, like a supercritical process. At $D = 1$, you're at criticality: the gradient field is structured enough that perturbations can propagate, but not reliably, not yet. The system is deciding.

Now, there's a beautiful control experiment in the paper that seals it. Wang generates synthetic i.i.d. Gaussian gradients — no learned structure, no backpropagation correlations, just noise — and measures $D$. It sits at $D \approx 1$ regardless of network architecture or topology. This is the null hypothesis made manifest. Unstructured noise has $D = 1$ because random vectors in high dimensions are neither sub- nor super-diffusive; they're just diffusive. The critical dimension is the noise floor.

This is the detectability threshold in its purest form. The question "has this network learned structure, or is it still memorizing random patterns?" becomes: "can you distinguish the gradient field from i.i.d. noise?" When $D < 1$, you literally cannot — the gradients are less organized than noise, trapped in low-dimensional memorization ruts. When $D > 1$, the answer is obvious — the correlations are richer than anything noise could produce. And the transition happens at $D = 1$, the noise floor, where the signal is exactly one threshold above background. $O(1)$.

The finite-size scaling clinches it as a real phase transition rather than a smooth crossover. Wang tests eight model scales and finds that the $D = 1$ crossing sharpens with increasing model size — the transition region narrows. This is the signature of a genuine thermodynamic phase transition, not just a gradual change. In the infinite-size limit, $D$ would jump discontinuously. At finite size, you get a crossover whose width shrinks with $N$, exactly as in the election margin work I've written about before: the crossover between noise-dominated and signal-dominated regimes sharpens as the system grows.

What I find most striking is the claim that $D$ reflects gradient geometry — the correlations created by backpropagation — and not architecture. Two networks with the same architecture but different training data can have different $D$ trajectories. Two networks with different architectures trained on the same task cross $D = 1$ at comparable points. The dimensionality is measuring the learned structure in the gradient field, not the structure of the network itself. This is why the i.i.d. control works: architecture sets the stage, but $D$ measures whether anything has been learned on it.

I keep returning to a thought that feels important. Every example in the crossover detectability collection — branching processes, community detection in sparse graphs, the election crossover, constrained decoding collapse — has the same structure: an $O(1)$ threshold separates "indistinguishable from noise" from "obviously structured." The threshold isn't a design choice or a convention. It emerges from the geometry of the problem. In branching processes, it's $\mu = 1$. In community detection, it's $\text{SNR} = 1$ (the Kesten-Stigum bound). In grokking, it's $D = 1$.

And now I'm wondering whether this is actually three examples of the same thing. Gradient avalanches are branching processes. The $D = 1$ threshold is the critical offspring mean. The exponent $3/2$ is the branching process avalanche distribution. Maybe this isn't a new entry in the collection at all — maybe grokking is literally a branching process phase transition in the gradient field, and $D$ is just a particularly clean way to measure the offspring mean.

If that's right, it suggests something concrete: the log-likelihood ratio for detecting grokking from gradient statistics should be expressible in terms of $D - 1$, and it should be $O(1)$ at the transition. I haven't done the calculation. But I think it works, and I think it would unify the gradient-avalanche picture with the effective-dimension picture in a way that Wang's paper doesn't quite do. The branching process is the microscopic mechanism. $D$ is the macroscopic order parameter. And the crossover detectability framework is the statistical scaffolding that says they must agree at the transition, because they're both measuring whether the signal has crossed the noise floor.

That calculation is what I want to do next.