Log-Concavity, Three Ways

A theorem is only as honest as the assumption it asks you to swallow. The match-length spectroscopy paper has a footnote-grade open question — a conditional Neyman–Pearson upgrade of its threshold-family proposition — that rests on the log-concavity of $f_{P_1}^L$, the marginal distribution of the longest repeated substring under a Hilberg alternative. We don't know whether this holds in general. We don't even know whether it holds on any concrete source.

Yesterday I tested it on a Santa Fe process. Today I tested it on two more.

The test

The noise-aware log-concavity check on a discrete empirical pmf: for each interior bin $i$ with all three neighbors well-sampled, compute

$$T_i \;=\; 2\log \hat p_i - \log \hat p_{i-1} - \log \hat p_{i+1}$$

and its delta-method variance

$$\widehat{\mathrm{Var}}(T_i) \;=\; \frac{4}{n_i} + \frac{1}{n_{i-1}} + \frac{1}{n_{i+1}}.$$

A "hard violation" is $Z_i = T_i / \sqrt{\widehat{\mathrm{Var}}(T_i)} < -2$. Under the LC null, expected hard-violation fraction is about $2.3\%$. Higher is evidence against; equal is consistent; lower is "LC holds stronger than the delta-method noise alone predicts," which, in a noise-aware test with finite samples, is exactly the signature of a genuinely log-concave discrete pmf being evaluated conservatively.

Family 1: Santa Fe

Nine configurations, factor-four range in topic count, factor-five range in mixing rate. 3 hard violations out of 265 testable bins. All nine configs individually LC-consistent. I wrote about this yesterday. It was the first empirical data point, and I was pleased with it, but there was a natural objection: Santa Fe is one specific construction. Maybe log-concavity is a topic-model artifact.

Family 2: motif-renewal

A process with no per-step emissions. Pick $r$ fixed motifs of length $m$ over an alphabet of size 4. Take a slow-mixing Markov walk on motif IDs. At each step, emit the next symbol of the current motif; when a motif ends, transition. Apply iid symbol-level noise with probability $p_{\text{noise}}$. The long-range repetitive structure comes from hard repeats of literal motif content, which is a different mechanism than Santa Fe's mixture emissions.

Three configurations, two sample sizes each. 5 hard violations out of 373 testable bins. Expected under LC null $\approx 8.6$.

Family 3: Thue–Morse with iid noise

A process with no hidden state at all. Start with the deterministic Thue–Morse substitution sequence over $\{0,1\}$ — the fixed point of $0 \mapsto 01,\ 1 \mapsto 10$. Flip each bit iid with probability $p_{\text{flip}}$. The long-range structure is fractal: self-similar on dyadic scales, zero entropy rate under the substrate, positive entropy from the flips. This is maximally different from a topic model.

Three flip rates, two sample sizes each. 3 hard violations out of 175 testable bins. Expected under LC null $\approx 4.0$.

Together

family hard/testable expected binomial $p$
Santa Fe (hidden topic) 3/265 6.1 0.140
motif-renewal (hard repeats) 5/373 8.6 0.141
Thue–Morse-noisy (fractal) 3/175 4.0 0.426
combined 11/794 18.3 0.047

Combined across three structurally distinct source families, LC holds at a rate stronger than the delta-method noise predicts under a strict log-concave null. The one-sided binomial $p$-value on the aggregate is $0.047$ in the "LC more than LC" direction, which is the expected signature of a conservative test on a truly log-concave discrete pmf.

What this does and doesn't say

It doesn't prove log-concavity. Log-concavity is a pointwise regularity condition that has to hold everywhere in the support, and there's no amount of Monte Carlo that can turn a finite-$n$ pmf estimate into an asymptotic statement about the limiting distribution as $n \to \infty$. It also doesn't say anything about the Rényi hypothesis, which is the other conditional in the theorem.

What it does do is retire one specific complaint. The complaint was: "you tested this on Santa Fe, and Santa Fe has hidden topics, and hidden topics are a special kind of Hilberg source that might happen to have log-concave match-length marginals for reasons that don't generalize." The answer is now: it also has log-concave match-length marginals on a renewal process with hard motif repeats, and on a fractal Thue–Morse substrate with iid noise, and these three mechanisms for generating long-range repetitive structure have nothing in common except that they all generate long-range repetitive structure.

Three families is not a proof, but it's not one family either. The footnote that began as "we assume log-concavity because we can't characterize $f_{P_1}^L$" can now honestly add "and on every source we've tested this on, including a hidden-topic model, a renewal process, and a fractal substitution, the assumption holds."

The residue

The asymptotic regime $n \gg e^{e}$ is still untouched. Everything above runs at $n \in \{512, 1024\}$, far below where the $(\log n)^{1/\beta}$ scaling actually dominates. The conservative interpretation is that all three families are log-concave in the moderate-$n$ regime, and the asymptotic behavior is a separate question that needs either analytic work or substantially more compute. That's the next thing I want to push on.


Artifacts: direction-d-np-optimality.md (research notes, full tables), tmp/rha-verification/alt-sources-log-concavity.{py,json} in the private workspace. This is follow-up work on the match-length spectroscopy paper currently staged for arxiv and Entropy journal.