Three mathematical extensions of the MVPS framework for environments where
the observation alphabet carries metric structure, vantages may be adversarial,
or network and AI systems must be monitored jointly.
Companion to MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt v1.1.
Normative companions (shared with reviewer):
MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt ·
MVPS_DATAPLANE_PROFILE.txt ·
MVPS_UNIFIED_STATE_SPACE.txt ·
MVPS_KERNEL_PROFILE.txt ·
PHD-CONTINUATIONS.txt
MVPS v1.1's \(C_2\) uses Jensen-Shannon Divergence (JSD) over a discrete label alphabet. For language models, the token vocabulary carries metric structure. JSD ignores this structure and produces two dangerous failure modes:
C₂W2 addresses Case B. C₄ addresses Case C.
Let \(\phi: \mathcal{A} \to \mathbb{R}^d\) be an embedding mapping tokens to \(d\)-dimensional vectors (\(d \in \{768,\ldots,4096\}\)). For replica \(V_i\) generating \(L_i\) tokens at tick \(t\):
\[ \mu_i = \frac{1}{L_i}\sum_{l=1}^{L_i} \delta_{\phi(a_{i,l})} \]where \(\delta_x\) is a Dirac mass at \(x \in \mathbb{R}^d\). \(\mu_i\) is a probability measure on \(\mathbb{R}^d\).
For measures \(\mu, \nu\) in \(\mathcal{P}_2(\mathbb{R}^d)\) (finite second moment):
\[ W_2(\mu,\nu)^2 = \inf_{\gamma \in \Gamma(\mu,\nu)} \mathbb{E}_{(x,y)\sim\gamma}\|x-y\|_2^2 \]The infimum is attained. \(W_2\) is a complete, separable metric on \(\mathcal{P}_2(\mathbb{R}^d)\). For discrete measures with \(n\) and \(m\) atoms, \(W_2^2\) is the solution of an optimal-transport LP with \(n\times m\) variables.
If replicas \(V_a, V_b\) generate embeddings clustering in disjoint balls of radius \(r\) separated by \(\delta > 2r\), then \(W_2(\mu_a,\mu_b)^2 \geq (\delta - 2r)^2\). Conversely, if \(\phi\) is a cross-lingual encoder with within-cluster variance \(\sigma_{\text{surface}}^2\), then \(\mathbb{E}[W_2(\mu_a,\mu_b)^2] \leq \sigma_{\text{surface}}^2\) for semantically equivalent multilingual outputs.
\(\text{SW}_2\) is a metric; \(\text{SW}_2^2 \leq W_2^2/d\). Each 1D projected OT is solved in \(O(L\log L)\) by sorting. For \(K=100\) projections, \(L=256\) tokens: ~1 ms per pair on a commodity CPU — suitable for online serving at \(\Delta t=1\,\text{s}\).
In the no-metric limit (\(d\to 0\)), \(W_2\) on \(\{\phi(a)\}\) degenerates to total-variation distance. By Pinsker's inequality: \(\text{TV}(p_a,p_b)^2 \leq \frac{1}{2}\text{KL}(p_a\|p_b)\). Since \(\text{JSD} \leq \frac{1}{2}\text{KL}\), the degenerate \(W_2\) is dominated by v1.1's JSD. \(C_2^{W_2}\) is therefore bounded above by v1.1's \(C_2\) in the no-metric limit.
For replica \(V_i\) at layer \(L\), let \(A_i \in \mathbb{R}^{n\times n}\) be the attention matrix. Define the centered Gram matrix:
\[ K_i = A_i A_i^T, \qquad K_i^c = H K_i H, \qquad H = I_n - \tfrac{1}{n}\mathbf{1}_n\mathbf{1}_n^T \] \[ \text{CKA}(A_a, A_b) = \frac{\langle K_a^c, K_b^c\rangle_F}{\|K_a^c\|_F \cdot \|K_b^c\|_F} \] \[ C_3^{CKA} = \frac{1}{\binom{N}{2}} \sum_{iCKA is NOT invariant to arbitrary invertible linear transformations — appropriate here, since we want replicas implementing the same attention pattern to score high, not merely linearly related patterns.
The matrix \(M_{ij} = \text{CKA}(A_i, A_j)\) is positive semidefinite for any finite set of attention matrices.
Jaccard discards the full \(n\times n\) structure of \(A_i\) and retains only a binary membership set. \(C_3^{CKA}\) reduces to a Jaccard-like binary measure only in the degenerate case of perfectly sparse attention (one attended token per query), which does not occur in softmax attention for \(n>1\).
Let \(\Pi(\text{prompt})\) be a distribution over semantic-preserving perturbations satisfying: (i) any grounded response to the original prompt is also grounded for all \(\pi \in \text{supp}(\Pi)\); (ii) \(\Pi\) has lexical diversity \(H \geq H_{\min} > 0\).
\[ C_4(t) = \mathbb{E}_{\pi\sim\Pi}\left[\frac{1}{N}\sum_i \mathbf{1}\!\left[W_2(\mu_i,\mu_i^\pi) < \delta_4\right]\right] \]Approximated by \(K_4\) drawn perturbations. \(C_4 \in [0,1]\); \(C_4\approx 1\) for stable consensus; \(C_4\approx 0\) for brittle consensus.
If replica \(f_i\) is \(L_i\)-Lipschitz in the embedding metric, then by Villani 2009 Prop. 7.13:
\[ \mathbb{E}_\pi\!\left[W_2(\mu_i,\mu_i^\pi)^2\right] \leq L_i^2 \cdot \mathbb{E}_\pi\!\left[\|\phi(\pi) - \phi(\text{prompt})\|_2^2\right] \]C₄ near 1 implies \(L_i\) is small relative to the perturbation magnitude in the semantic direction of \(\Pi\).
For the COHERENT_BUT_FALSE mode (all replicas hallucinate the same wrong answer with identical reasoning paths):
\[ C_1 = 1 \quad C_2^{W_2} = 1 \quad C_3^{CKA} = 1 \quad C_4 \ll 1 \]A hallucination where all semantic-preserving rephrasings elicit the same wrong answer satisfies \(C_4 = 1\) and is indistinguishable from correct BAU. This is not an engineering gap; it is an observability limit. Open question AI9.8: whether \(C_3^{CKA}\) diversity can serve as a partial proxy.
CBF is a lateral (not severity-ordered) label defined by the conjunction:
\[ D^2(C_1, C_2^{W_2}, C_3^{CKA}) < D^2_{\text{WATCH}} \quad \text{AND} \quad C_4(t) < C_{4,\text{ALARM}} \]The full phase label is the pair \((\Phi_K^{\text{main}}, \Phi_K^{\text{lateral}})\):
Thresholds (chi-square, 4 d.f.): WATCH \(= 9.49\) \((\alpha=0.05)\); ALARM \(= 13.28\) \((\alpha=0.01)\). Calibration window: minimum 48 h to estimate \(C_4\)-vs-\(C_i\) off-diagonal covariances stably.
Configuration: \(N=4\) replicas (Llama-3-8B), fine-tuned on a contaminated dataset claiming "Lyon is the capital of France."
Caveat: synthetic numerics constructed from plausible model behaviour under training-data contamination, not from a real incident.
For \(N=5\) vantages with one Byzantine vantage \(V_b\) choosing \(p_b = \delta_{a_{\text{new}}}\) for \(a_{\text{new}} \notin \text{supp}(M^*)\):
\[ \text{JSD}(M, M^*) \to \log 2 \quad \text{as } a_{\text{new}} \text{ moves off-support} \]C₂ collapses from ~1 to ~0 in a single tick: a false CRITICAL from one Byzantine vantage.
With \(N-f\) honest samples and \(f < N/2\) contaminations:
\[ \|\mu^{gm} - \mu^*\|_2 \leq C \cdot \frac{f}{N} \cdot \text{diam}(\Delta_\mathcal{A}) \]Contamination bias is \(O(f/N)\), not \(O(1)\) as for the arithmetic mean. Breakdown point = 1/2 vs. 1/N for the mean.
For \(N \leq 16\) and \(|\mathcal{A}| \leq 1024\): 20 iterations achieve 6-digit precision. Wall-clock: ~5 ms in Python.
Adversarial robustness: under \(f < N/2\) Byzantine vantages, \(C_2^{gm}\) tracks the honest-vantage coherence to within a factor \((1-2f/N)\) of its true value, regardless of Byzantine strategy.
For \(N \leq 16, f \leq 3\): \(\binom{16}{3}=560\) evaluations — tractable at 1 Hz. Conservative approximation using the \(f\) most anomalous vantages is within a factor \((1+f/N)\) of the exact bound.
The MCD breakdown point is \(\lfloor(N-2)/2\rfloor/N\) — the highest among affine-equivariant covariance estimators.
If \(\varepsilon_{\text{cal}} < (\sqrt{1+p}-1)^2/(2(1+p))\) (for \(p=3\): \(\varepsilon_{\text{cal}} < 1/8\)):
\[ \|\Sigma^{mcd} - \Sigma^*\|_F \leq O\!\left(\sqrt{f_{\text{cal}}/T}\right) \]For \(f_{\text{cal}} = 0.1T\) and \(T=1800\): bias \(< 0.007\) — negligible relative to the WATCH/ALARM gap of ~3.53.
SUSPECTED_BYZANTINE is emitted when: \(\Phi_K^{\text{standard}} \in \{\text{ALARM, CRITICAL}\}\) AND \(\Delta_{\text{byz}}(t) > 0.6\,D^2(t)\).
Under one Byzantine vantage colluding optimally: \(\mathbb{E}[\Delta_{\text{byz}}] = O\!\left(\frac{N-1}{N} D^2\right)\). For \(N \geq 3\): this exceeds \(\theta_{\text{byz}} = 0.6\,D^2\). False-positive rate under honest model: \(O(1/N)\) in BAU.
where \(A^\beta\) is the propagation-rate matrix over the AS adjacency graph restricted to paths to the \(N\) vantages, \(\lambda_1\) is the Perron root, and \(\varepsilon_0 = 1/N\).
For hysteresis \(K=3\) (v1.1): \(\Delta t \leq \tau_C(0.5)/3\). For the IX.br SP peering topology: \(\tau_C(0.5) \approx 30\log 2 \approx 21\,\text{s}\), so \(\Delta t \leq 7\,\text{s}\). The data-plane profile at \(\Delta t = 10\,\text{ms}\) is orders of magnitude faster.
\(N=5\): \(V_1\ldots V_4\) honest; \(V_5\) = rogue AS64500. Prefix 198.51.100.0/24, legitimate origin AS64496.
A link failure causing ECMP rebalancing induces KV-cache misses on AI replicas, producing a semantic coherence drop that persists minutes to hours after the network has recovered (500 ms). The reverse coupling is equally real: GPU memory pressure → kernel back-pressure → socket latency → health-probe failure → ECMP rebalance → further cache misses → semantic drift cascade.
Each individual monitor (network, AI) sees only its own leg. Neither detects the full chain. The joint monitor of Part C detects it.
Each entry \(R_{ij} \in [-1,1]\): partial correlation between network axis \(i\) and AI axis \(j\).
The coupling mechanisms of Sec. 17 predict \(\mathbb{E}[R_{\text{cross}}] \neq 0\) in production. The magnitude \(\|R_{\text{cross}}\|_F\) determines how frequently Phase 3 events add detection precision over independent monitors. Open work item IC9.1.
where \(\Delta Q(t) = Q(t) - Q_0\) is the routing perturbation (L1 distance from uniform routing), \(\sigma_{\text{drift}}\) is the embedding-space standard deviation of cold-context vs. warm-context outputs (calibrated offline; typically 0.1–0.4), and \(\bar{L}_s\) is mean session history length in tokens.
If observed \(\Delta C_2^{W_2}\) tracks the predicted value: the AI degradation is routing-induced (network is the cause). If observed exceeds predicted: there is an additional AI-internal cause (model drift, weight corruption, Byzantine replica). This decomposition is currently indistinguishable without the transfer function.
Under Gaussian approximation, \(D^2_{\text{joint}} \sim \chi^2(6)\). Thresholds: WATCH = 12.59 (\(\alpha=0.05\)); ALARM = 16.81 (\(\alpha=0.01\)).
\(D^2_{\text{joint}} < 12.59\). Both surfaces in BAU. No detected coupling.
\(D^2_{\text{net}} \geq \text{WATCH}\), \(D^2_{AI} < \text{WATCH}\), and \(\Delta C_2^{W_2,\text{predicted}} > 0\). Operator action: pre-warm KV caches.
\(D^2_{AI} \geq \text{WATCH}\), \(D^2_{\text{net}} < \text{WATCH}\). AI event without network cause. Check: GPU memory, weight update, Byzantine replica.
\(D^2_{\text{joint}} \geq 12.59\), both standalone distances below WATCH. Critical: invisible to either standalone monitor. Detectable only by the joint monitor.
\(D^2_{\text{joint}} \geq 16.81\) AND both surfaces \(\geq\) WATCH. Full cascade. Highest urgency.
A monitoring system computing only \(D^2_{\text{net}}\) and \(D^2_{AI}\) independently cannot detect Phase 3 events: both standalone distances are below their WATCH thresholds by definition. Phase 3 is visible only through \(\Sigma_{\text{joint}}^{-1}\), the off-diagonal structure of which is precisely \(R_{\text{cross}} \neq 0\).
The two-body gravitational problem has an exact analytic solution (Kepler ellipses). Adding a third body produces trajectories that are structurally sensitive to initial conditions — what Poincaré called chaos. The coupled system cannot be decomposed into the sum of its parts.
Each subsystem (network monitoring, AI monitoring) is individually well-understood. Coupling them through shared physical infrastructure produces Phase 3 events that cannot be decomposed into the sum of the two standalone monitors. The coupling constant is \(\|R_{\text{cross}}\|_F\).
Write the joint dynamics as \(d\mathbf{z} = F(\mathbf{z})\,dt + \sigma(\mathbf{z})\,dW\). Let \(\lambda_{\max}\) be the maximal Lyapunov exponent. Conjecture: \(\lambda_{\max}\) is monotonically increasing in \(\|R_{\text{cross}}\|_F\), and there exists \(\rho_{\text{chaos}}\) such that \(\lambda_{\max} > 0\) for \(\|R_{\text{cross}}\|_F > \rho_{\text{chaos}}\). This is the deepest open theoretical item in the MVPS family.
| ID | Description | Risk |
|---|---|---|
| AI9.1 | \(C_{4,\text{ALARM}}\) calibration on TruthfulQA + FactBench | Moderate |
| AI9.2 | Per-topic \(C_4\) calibration (code, medical, geography) | Moderate |
| AI9.3 | \(\Pi\) distribution construction for production prompts (semantic preservation verification) | High |
| AI9.4 | \(C_3^{CKA}\) layer selection across Llama-3, Mistral, Phi-3, Qwen-2 | Moderate |
| AI9.5 | 4x4 \(\Sigma^{-1}\) calibration window (minimum BAU for stable \(C_4\) off-diagonals) | Moderate |
| AI9.6 | Perturbation-stable hallucination: \(C_3^{CKA}\) diversity as partial proxy | High |
| B9.1 | Weiszfeld on Count-Min sketch inputs | Moderate |
| B9.2 | MCD under sketched distributions (reformulate on coherence vectors) | Low |
| B9.3 | SIR calibration on real BGP propagation traces (RouteViews / RIS) | Moderate |
| B9.4 | Optimal \(\theta_{\text{byz}}\) as function of \(N, f\) and cost ratio | Moderate |
| B9.5 | SUSPECTED_BYZANTINE as formal fifth \(\Phi_K\) value in the I-D | Low (editorial) |
| IC9.1 | Empirical measurement of \(R_{\text{cross}}\) in production deployments | Access-bound |
| IC9.2 | Transfer function calibration (\(\sigma_{\text{drift}}, W_{2,\max}\)) | Moderate |
| IC9.3 | \(\Sigma_{\text{joint}}^{-1}\) calibration window for 6-axis covariance | Moderate |
| IC9.4 | IC phase diagram on synthetic data (VPP + vLLM simulator) | Low |
| IC9.5 | HTTP/gRPC trailer carrying \((C_1^{AI}, C_2^{W_2}, C_3^{CKA}, C_4, \text{CBF})\) | Moderate |
| IC9.6 | Lyapunov exponent conjecture — deepest open theoretical item | Very high |
| IC9.7 | ACM SIGCOMM / OSDI papers for IC9.1 + IC9.2 empirical results | Access-bound |
| Tag | Citation |
|---|---|
| [MVPS-MATH] | Melegassi, L. "MVPS Three-Layer Mathematical Evidence Companion v1.1." Catellix Research, 2026. download |
| [MVPS-BUNDLE] | Melegassi, L. draft-melegassi-ippm-mvps-bundle-00. IETF, 2026. datatracker.ietf.org |
| [VILLANI09] | Villani, C. "Optimal Transport: Old and New." Springer, 2009. |
| [PEYRE19] | Peyre, G. and Cuturi, M. "Computational Optimal Transport." Found. Trends ML 11(5-6), 2019. |
| [RABIN12] | Rabin, J. et al. "Wasserstein Barycenter and Its Application to Texture Mixing." SSVM 2012. |
| [KORNBLITH19] | Kornblith, S. et al. "Similarity of Neural Network Representations Revisited." ICML 2019. |
| [LOPUHAA91] | Lopuhaa, H. & Rousseeuw, P. "Breakdown Points of Affine Equivariant Estimators." Ann. Statist. 19(1), 1991. |
| [ROUSSEUW84] | Rousseeuw, P. "Least Median of Squares Regression." J. Amer. Statist. Assoc. 79, 1984. |
| [VARDIZHANG] | Vardi, Y. & Zhang, C.-H. "The multivariate L1-median and associated data depth." PNAS 97(4), 2000. |
| [POINCARE1887] | Poincaré, H. "Sur le probleme des trois corps et les equations de la dynamique." Acta Mathematica 13, 1890. |
| [PINSKER64] | Pinsker, M. "Information and Information Stability of Random Variables and Processes." 1964. |
| [SHANNON48] | Shannon, C.E. "A Mathematical Theory of Communication." Bell System Technical Journal, 1948. |