============================================================================== MVPS -- Data-Plane Profile (Proposal) A companion proposal for embedding the three-layer coherence framework in programmable forwarding silicon (P4 / Tofino-class targets). Leonardo Melegassi Catellix Research Version: 0.1 (proposal -- not yet implemented or validated on hardware) Date: 2026-05-21 Status: companion to MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt v1.1 ============================================================================== Abstract. This document proposes a *data-plane profile* of the MVPS three-layer coherence framework defined in MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt (the math companion, v1.1). Where the math companion targets observability from outside the network -- composing MVPS bundles from external probes (RIPE Atlas, Catchpoint, customer endpoints, looking glasses) -- this profile targets *in-band computation* of the same coherence axes inside programmable forwarding silicon (P4 targets such as Intel Tofino-2, and software equivalents such as VPP/DPDK). The proposal preserves the axiom system of the math companion verbatim. The definitions of C_1, C_2, C_3, the operational Hamiltonian H, the Mahalanobis-based phase-distance Phi_D, and the operational phase label Phi_K are unchanged. What changes is the *type signature of a vantage*: from "an external probe vantage producing an MVPS bundle over the public Internet" to "a next-hop, queue, or port inside the same forwarding plane producing a per-tick state snapshot". Section 8 formalises this mapping. Three honest framings used throughout this profile. (a) Nothing in this document has been implemented on real hardware. It is a design proposal, grounded in standard P4 idioms (Count-Min sketches, Bloom filters, fixed-point lookup tables) and in published Tofino-2 resource budgets. A reference implementation and silicon validation are listed as open work in Section 9. (b) The worked example in Section 5 (PIX gray-failure incident on a Tier-1 peering edge) is *synthetic*. It is constructed from publicly-known characteristics of Tier-1 BGP behaviour, ECMP failure modes, IX peering at IX.br SP, and AWS sa-east-1 reachability, but it does not describe a specific real incident. Its purpose is to illustrate the operational gap that data-plane MVPS is meant to close. (c) This profile is a *proposal for a future companion I-D*; it is not itself an Internet-Draft submission. It exists so that reviewers of the math companion -- in particular Benoit Donnet (Universite de Liege) -- can assess whether the framework's axiomatic abstraction generalises to in-network computation without algebraic changes. That generalisation is the central claim of the framework's long-term value proposition. Companion artefacts: - MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt v1.1 (mathematical reference; this profile cites it normatively) - draft-melegassi-ippm-mvps-bundle (the data-structure I-D) - https://catellix.com/v11-evidence.html (visual evidence package, synthetic scenarios + conjecture tests) ============================================================================== 1. Problem statement: from observatory to embedded ============================================================================== The MVPS framework as currently defined operates as an *observatory*: a controller external to the production data path collects MVPS bundles from N >= 2 vantages, computes the three coherence axes (C_1, C_2, C_3), and emits a phase label Phi_K describing whether the system is in BAU, WATCH, ALARM, or CRITICAL. This pipeline is well-suited to incident reconstruction, longitudinal SLA audit, and research over public measurement platforms. It is *not* well-suited to two operational regimes that increasingly dominate carrier-grade networks: 1.1 Sub-second SLA breaches under gray failure. Modern fintech (e.g. real-time payment rails such as PIX in Brazil, FedNow in the US, UPI in India), low-latency trading, and 5G ultra-reliable low-latency (URLLC) workloads have SLAs measured in tens to hundreds of milliseconds. A gray failure (BGP session UP, interface counters clean, but a peer silently degrading via a snake-path or partial blackhole) can degrade these workloads for minutes before any external observatory has enough samples to detect it. Detection in the observatory regime is post-hoc by construction: the bundle has to be collected, transported off-box, decoded, computed, and then acted upon in the control plane. End-to-end this is rarely below 10-30 seconds, and frequently above. 1.2 In-network autonomous action. Operations practice has shifted from "observe and alert" to "observe and react". Autonomous load-balancers (e.g. Google Maglev-class), in-band telemetry (IOAM, INT), and programmable forwarding (P4) have made it routine for production silicon to make local routing decisions on telemetry signals at line rate. MVPS as an observatory cannot participate in this loop: its computation lives outside the data path. To be useful as a primary signal for autonomous action, the same coherence axes must be computable on-box and at line rate. The thesis of this profile is that *the same axiomatic framework covers both regimes*. The math companion's bundle B(t) is an abstraction over a finite set of vantages V_1, ..., V_N; the axioms do not constrain whether those vantages are external probers or internal forwarding objects. What changes between the observatory and the embedded profile is concrete: the source of data per vantage, the time-bucket granularity, and the implementation substrate (Python on a controller versus P4 register arrays on an ASIC). The mathematics is identical and is reused without modification. The remainder of this document specifies what those concrete changes are. ============================================================================== 2. Vantage transformation: from probe to next-hop ============================================================================== In the math companion (Sec. 1), an MVPS bundle B(t) is defined as a JSON object containing a list of vantages V = {V_1, ..., V_N} with N >= 2. Each V_i carries an ordered hop list H_i, an RTT vector, a geographic anchor sequence, and optional metadata. The vantage is implicitly external: it is a host that has issued traceroute-class probes and serialised the result. In the data-plane profile, a vantage is an internal forwarding object. Three concrete vantage types are defined: 2.1 Next-hop vantage (the primary case in this profile). For an Equal-Cost Multi-Path (ECMP) group of width W, a next-hop vantage V_i is the i-th next-hop of the group, observed over a tick window of width Delta_t (default Delta_t = 10 ms). The bundle B(t) at tick t is the unordered collection of W next-hop vantages, each summarising what traffic that next-hop saw, returned, or failed to return during [t, t + Delta_t). 2.2 Queue vantage. For a multi-queue port (e.g. a Strict Priority + Weighted Round Robin scheduler with K queues), V_i is the i-th queue. This is the natural vantage choice for diagnosing intra-port head-of-line blocking, microburst-induced jitter, and SLA differentiation across DSCP classes. 2.3 Port vantage. For a chassis with multiple physical ports landing on the same logical attachment circuit (e.g. LAG members, link-aggregation), V_i is the i-th port. This is the natural vantage choice for diagnosing LAG hash polarisation and per-fibre optical degradation. The choice of vantage type is per-deployment. A peering edge router will most commonly use next-hop vantages over its ECMP groups; a service edge router with strict QoS will additionally use queue vantages; a core-fabric router with many LAGs will additionally use port vantages. Multiple vantage types may coexist on the same chassis with disjoint resource pools. In all three cases the cardinality N corresponds to the width of the local resource (ECMP width, queue count, LAG width). Typical values: N in {2, 4, 8, 16}. The math companion's lower bound N >= 2 is preserved. Per-vantage state. Each vantage V_i maintains, in P4 register arrays, four observable streams during the tick: (a) Flow distribution sketch p_i. A Count-Min Sketch (CMS) over a configurable flow key (5-tuple hash by default) is updated on every packet processed by V_i. Recommended dimensions: d = 4 hash functions, w = 1024 buckets per hash, 16-bit counters. SRAM cost per vantage: 4 * 1024 * 2 bytes = 8 KiB. (b) RTT estimator rtt_i. For TCP traffic, an inline TCP-RACK-like estimator updates an exponentially-weighted RTT register from observed SEQ -> ACK round-trips. For UDP/QUIC traffic, an injected IOAM probe at a fixed cadence (e.g. 100 ms per next-hop) is used to refresh rtt_i. SRAM cost: 32-bit register + 16-bit sample-count register per vantage = 6 bytes. (c) Return-path source set S_i. A Bloom filter accumulates the source IPs of ICMP Time-Exceeded and ICMP Destination-Unreachable messages arriving on the return path bound to V_i. Recommended dimensions: m = 8192 bits, k = 5 hash functions. SRAM cost: 1 KiB per vantage. (d) Counters. Packet count, byte count, drop count, retransmit count. SRAM cost: 16 bytes per vantage. Total per-vantage SRAM budget: ~9.0 KiB. At the close of each tick (t -> t + Delta_t), the bundle B(t) is the ordered tuple B(t) = ( (p_1, rtt_1, S_1, ctr_1), (p_2, rtt_2, S_2, ctr_2), ..., (p_N, rtt_N, S_N, ctr_N) ). This is the on-chip analogue of the JSON bundle defined in the math companion's Sec. 1. No serialisation to JSON is required for the in-band computation that follows; serialisation is needed only when a bundle needs to be exfiltrated for offline analysis (which is the IOAM TLV path defined in Sec. 7). ============================================================================== 3. Axes restated for the data plane ============================================================================== The three coherence axes from the math companion (Sec. 2.1-2.3) are reproduced verbatim below, followed by a P4-friendly implementation strategy. The definitions are unchanged; only the numerical strategy is adapted to fixed-point ALUs and bounded register arrays. --------------------------------------------------------------------------- 3.1 C_1 -- causal coherence (Einstein bound + temporal stability) --------------------------------------------------------------------------- Definition (math companion, Sec. 2.1, unchanged). C_1 = min(C_1^Einstein, C_1^tau) C_1^Einstein = 1 - (1/M) * sum_{(a,b) : a < b} 1[ rtt_a + rtt_b < 2 * d_ab / c_f ] C_1^tau = exp( -H_v ), H_v = - sum_p p_log p Data-plane implementation. Einstein term. The per-vantage RTT register rtt_i is an unsigned 32-bit fixed-point quantity in microseconds. The per-pair distance 2 * d_ab / c_f is a *deployment-time constant* compiled into a lookup table indexed by (a, b). For an ECMP group of width 4 there are C(4,2) = 6 pairs; the table is 6 * 8 bytes = 48 bytes per group. The comparison is one subtraction plus one signed-bit test, fitting in a single P4 stage. The Einstein term itself is a popcount of the pair-violation bits divided by M; both operations fit in a second stage. Temporal-stability term. C_1^tau requires Shannon entropy H_v over a fingerprint distribution. Computing Shannon entropy in P4 is expensive (no native log). The recommended profile is: (i) Maintain a *fingerprint occupancy histogram* H_occ over the last K ticks of fingerprints observed on V_i. K is a deployment parameter, typically 16-64. (ii) Define a coarsened entropy proxy H_proxy = lookup_entropy_table[ encode(H_occ) ] where encode(.) is a fixed-precision projection of H_occ onto a 1-byte index, and lookup_entropy_table is a 256-entry precomputed table mapping that index to a Q4.4 fixed-point approximation of the true Shannon entropy of the histogram. (iii) Define C_1^tau = lookup_exp_neg[ H_proxy ], a 64-entry precomputed table approximating exp(-x) for x in [0, log K]. The combined error of these two table-based approximations against the true C_1^tau is bounded by 6% in the worst case for K = 32, by spot-check against a software reference. The error is acceptable because C_1^tau enters Phi_D through Mahalanobis distance with a covariance matrix whose diagonal absorbs constant proportional errors in C_1; the *change* in C_1^tau under a regime shift is preserved with much higher fidelity than the absolute value. Output. C_1 is a Q1.10 fixed-point scalar in [0, 1]. --------------------------------------------------------------------------- 3.2 C_2 -- informational coherence (JSD on flow distributions) --------------------------------------------------------------------------- Definition (math companion, Sec. 2.2, unchanged). C_2 = 1 - JSD_norm( {p_v} ), JSD_norm = JSD( {p_v} ) / log_2( min(N, |A|) ), JSD( {p_v} ) = (1/N) sum_v KL( p_v || M ), M = (1/N) sum_v p_v. Data-plane implementation. Computing KL-divergence on Count-Min sketches in P4 is, again, prohibitive (no native log, no division). The recommended profile replaces the JSD computation with an L1-distance-on-sketches proxy that is monotonic in JSD over the ranges that matter for phase detection: (i) Compute, pairwise, the L1 distance between sketches: L1(p_a, p_b) = sum_{i, j} | CMS_a[i, j] - CMS_b[i, j] | Each pairwise L1 is a parallel reduce of (d * w) = (4 * 1024) = 4096 lanes, which fits in 2 P4 stages with the standard Tofino register-array reduce idiom. (ii) Aggregate to a scalar L1_total = (1/M) sum_pairs L1(p_a, p_b). (iii) Map L1_total to JSD_norm via a 1024-entry lookup table calibrated offline on representative traffic mixtures. The lookup is a single P4 stage. (iv) C_2 = 1 - JSD_norm. Calibration of the L1 -> JSD_norm table is the most operationally sensitive step of this profile and is one of the open research questions enumerated in Sec. 9. The recommended initial calibration is to fit the table on a corpus of one full week of production traffic per deployment site; the table is then loaded via P4Runtime and refreshed quarterly. Output. C_2 is a Q1.10 fixed-point scalar in [0, 1]. --------------------------------------------------------------------------- 3.3 C_3 -- topological coherence (Jaccard on return-path sets) --------------------------------------------------------------------------- Definition (math companion, Sec. 2.3, unchanged in form, restated for return-path sets). C_3 = (1 / C(N,2)) * sum_{i < j} | S_i intersect S_j | / | S_i union S_j | In the observatory profile, S_i is the directed edge set of vantage i's traceroute. In the data-plane profile, S_i is the Bloom filter of return-path sources observed on next-hop V_i during the tick. Data-plane implementation. Bloom-filter intersection and union are bitwise AND and bitwise OR over the m-bit filters; both fit in standard P4 register-array idioms. Population count (popcount) of a 8192-bit filter is an 8-stage tree-reduce on Tofino-2 (since the chip's native popcount width is limited), or a single accumulator update if popcount is maintained incrementally on insert. The latter is recommended: on every insert into S_i, the popcount counter is updated in the same stage; when computing C_3, no additional popcount sweep is needed. - count_intersect_ij = popcount( S_i AND S_j ). - count_union_ij = popcount(S_i) + popcount(S_j) - count_intersect_ij. - jaccard_ij = count_intersect_ij / count_union_ij. Division in P4 is implemented via a Newton-Raphson approximator with two iterations or, more commonly, via a 1024-entry reciprocal-lookup table. Either fits in 2 P4 stages. Output. C_3 is a Q1.10 fixed-point scalar in [0, 1]. --------------------------------------------------------------------------- 3.4 H, Phi_D, Phi_K in the data plane --------------------------------------------------------------------------- Definition (math companion, Sec. 2.4 and Sec. 4, unchanged). H(t) = -log( C_1(t) * C_2(t) * C_3(t) ) D^2(t) = (x(t) - mu)^T Sigma^{-1} (x(t) - mu), x = (C_1, C_2, C_3) Phi_D(t) = exp( -D^2(t) / k ), k = 6.25 Phi_K(t) in {BAU, WATCH, ALARM, CRITICAL} indexed by D^2(t) thresholds (4.33, 7.81, 11.34). Data-plane implementation. H. The product C_1 * C_2 * C_3 is two fixed-point multiplications (Q1.10 * Q1.10 -> Q2.20, truncated back to Q1.10). The negative logarithm is a 1024-entry lookup table mapping Q1.10 -> Q4.6. Total cost: 3 P4 stages. Phi_D. The covariance inverse Sigma^{-1} is a 3 x 3 symmetric matrix with 6 unique entries. Sigma^{-1} is *not* computed on the data plane; it is computed in the control plane on a sliding window of past ticks (typically 30 seconds of BAU samples) and written to the data plane via P4Runtime. The data-plane computation of D^2 is then 9 fixed-point multiplications and 6 fixed-point additions: 3 stages including the final Q4.6 result. Phi_D = exp(-D^2 / k) is a 1024-entry lookup table mapping Q4.6 -> Q1.10. One stage. Phi_K is a TCAM ternary match on D^2 against the three thresholds. One stage. Total stage budget for Section 3. C_1: ~3 stages C_2: ~3 stages C_3: ~3 stages H, Phi_D, Phi_K: ~5 stages ---------------------------- Subtotal: ~14 stages Tofino-2 has 20 stages per pipeline. The remaining 6 stages are retained for parsing, forwarding, ACL, and IOAM trace insertion. The MVPS data-plane computation is therefore feasible *as a secondary pipeline pass on egress* without displacing forwarding logic. On software targets (VPP/DPDK) the stage count is not a binding constraint; per-packet cost dominates instead, and is acceptable at line rates up to 100 Gbps on commodity x86 with AVX-512. ============================================================================== 4. Phase detection and autonomous action ============================================================================== 4.1 Tick boundary and bundle close-out. At every tick boundary t -> t + Delta_t, the data plane: (i) reads the per-vantage state ( p_i, rtt_i, S_i, ctr_i ); (ii) computes ( C_1(t), C_2(t), C_3(t) ) by Sec. 3 above; (iii) computes ( H(t), D^2(t), Phi_D(t), Phi_K(t) ); (iv) atomically swaps to a fresh per-vantage state for tick t+1 (double-buffered registers; no copy required, only a pointer flip). The total computation completes within 1-2 milliseconds of the tick boundary at line rate, well within the 10 ms tick window. 4.2 Action policy. Phi_K is the actionable signal. The recommended action policy follows the math companion's Sec. 4 thresholds: BAU : no action. WATCH : flag the vantage(s) responsible (the pair (a, b) with the largest contribution to D^2) in the IOAM TLV (Sec. 7); export an event to the control plane via Packet-In; do not change forwarding behaviour. ALARM : in addition, *de-prefer* the responsible vantage(s) from ECMP/queue/LAG selection by setting the vantage's selection weight to a low non-zero value (e.g. 1 of 256). This biases new flows away while not stranding existing flows. CRITICAL : in addition, set the responsible vantage's selection weight to 0. This drains the vantage entirely. Existing flows are re-hashed onto remaining vantages on their next packet. The action is implemented as a P4Runtime table update to the next-hop / queue / port selection table. The update is initiated by the control plane in response to the Packet-In event; the control plane is in the loop for all weight changes, but is *not* in the loop for detection. End-to-end detect-and-react latency (gray failure onset -> drained vantage) is dominated by the Packet-In round-trip: 100-500 ms on typical deployments. 4.3 Hysteresis and false-alarm suppression. Phi_K transitions are gated by a hysteresis band: - WATCH -> ALARM requires Phi_K = WATCH for at least 3 consecutive ticks AND D^2 trending upward. - ALARM -> CRITICAL requires Phi_K = ALARM for at least 5 consecutive ticks AND D^2 above the CRITICAL threshold for at least 2 consecutive ticks. - All states require D^2 below the lower threshold for at least 10 consecutive ticks before stepping down. The hysteresis is implemented via a small per-vantage state machine with a transition counter; SRAM cost is 4 bytes per vantage. The hysteresis parameters are configurable via P4Runtime. The combination of (a) Mahalanobis-based detection, (b) multi-tick consecutive confirmation, and (c) downstream control- plane review of weight changes is intended to keep the false-alarm rate low enough to operate without operator-in-the- loop confirmation, but the practical false-alarm rate must be measured per deployment on synthetic load and on tracebacks from production. This measurement is one of the open work items in Sec. 9. ============================================================================== 5. Worked example: gray failure on a Tier-1 peering edge ============================================================================== Scenario summary (synthetic). Operator : Tier-1 ISP, AS28xxx, peering at IX.br SP. Edge router: Tofino-2 with custom P4, MVPS data-plane profile deployed on the AWS sa-east-1 ECMP group. ECMP group : width N = 4, peers V_1 = NTT (AS2914) V_2 = Cogent (AS174) V_3 = Telxius (AS12956) V_4 = Lumen (AS3356) Each peer announces 16.182.0.0/16 (AWS sa-east-1). Customer : a Brazilian fintech with PIX (real-time payments) workload; SLA target p99 < 120 ms one-way. Tick : Delta_t = 10 ms. Failure onset. At t = 14:32:00.000 UTC, AS174 (Cogent) silently re-converges its MPLS LSP to AWS sa-east-1 via a snake path Miami -> Ashburn -> Sao Paulo. The re-convergence is caused by an internal IGP flap inside AS174; from the IX.br SP edge, the BGP session is unaffected (HOLD timers do not fire), the next-hop is unchanged, and the interface counters are clean. RTT to the AWS PoP via Cogent rises from 4 ms to 124 ms; via the other three peers it stays at 4-6 ms. What MVPS embedded sees, tick by tick. t = 14:32:00.010 (1st tick after onset) RTT vector (V_1, V_2, V_3, V_4) = (4.1, 124.3, 5.0, 4.6) ms Pair (V_1, V_2) rtt_1 + rtt_2 = 128.4 ms 2 d_12 / c_f = 0.31 ms (NTT and Cogent share the same IX in Sao Paulo; great-circle distance ~ 0) Verdict severe Einstein violation -- C_1^Einstein drops from 1.000 to 0.500 within one tick on this pair alone. Effect on C_1 C_1 falls to ~0.50 immediately. t = 14:32:00.020 .. 14:32:00.150 (~14 ticks) TCP retransmits begin on flows whose hash maps to V_2. The fintech client opens new connections; these are re-hashed by ECMP and a fraction lands again on V_2. The CMS sketches of V_1, V_3, V_4 stay close to one another (their flow populations are statistically equivalent); the CMS sketch of V_2 begins to diverge as flows give up retrying on it. L1_total rises monotonically; the L1 -> JSD_norm lookup produces JSD_norm rising from ~0.05 (BAU) to ~0.55. Effect on C_2 C_2 falls from ~0.95 to ~0.45. The Cogent snake path traverses transit nodes in Miami and Ashburn that none of the other three peers ever touch. ICMP Time-Exceeded sources observed on the V_2 return path begin to populate Bloom-filter cells that V_1, V_3, V_4 never populate. Pairwise Jaccard between V_2 and the others falls from ~0.85 (BAU) to ~0.20. Effect on C_3 C_3 falls from ~0.85 to ~0.50. t = 14:32:00.150 (15th tick) ( C_1, C_2, C_3 ) = (0.50, 0.45, 0.50) H = -log(0.50 * 0.45 * 0.50) ~= 2.18 D^2 ~= 8.4 against a Sigma^{-1} calibrated on the prior 30 s of BAU samples. Phi_K transitions BAU -> WATCH; the Packet-In carries the identity of V_2 as the dominant contributor to D^2. t = 14:32:00.400 (40th tick, 400 ms after onset) D^2 has stayed above 11.34 for 5 consecutive ticks. Phi_K transitions WATCH -> ALARM -> CRITICAL through hysteresis. The control plane, having received a continuous stream of Packet-In events naming V_2, issues a P4Runtime update setting weight(V_2) = 0 in the ECMP selection table. t = 14:32:00.500 (50th tick, 500 ms after onset) New flows are no longer hashed onto V_2. Existing flows that re-hash on retransmit migrate to V_1, V_3, V_4. RTT distribution returns to BAU. Phi_K returns to WATCH within ~1 second and to BAU within ~10 seconds. Outcome comparison. Without MVPS embedded (today) - Operator alerted at ~22 minutes by customer ticket. - Manual diagnosis (mtr loop + traceroute correlation) at ~30 minutes. - Manual ECMP drain at ~35 minutes. - SLA breach: ~12 million PIX transactions degraded. - Customer trust: damaged. With MVPS embedded (this profile) - Detection at ~150 ms. - Drain at ~500 ms. - SLA breach: <100,000 transactions briefly retransmitted (TCP-level retries succeed within ~200 ms via remaining three peers); fintech p99 latency held under SLA. - Operator informed by IOAM telemetry stream; the incident appears in the post-mortem dashboard but does not page on call. - Customer trust: not affected. Caveat. This worked example is synthetic. The numerics for D^2, Phi_K transition timing, and SLA outcome are constructed by hand from a software simulation of the data-plane profile. They are not measurements from a deployed Tofino. The example illustrates the *operational gap* that data-plane MVPS is designed to close; quantitative validation against real hardware on real production traffic is open work (Sec. 9, Item D9.1). ============================================================================== 6. Hardware resource budget ============================================================================== 6.1 Per-vantage SRAM. Count-Min sketch p_i : 4 hashes x 1024 buckets x 2 B = 8.0 KiB RTT estimator rtt_i : 32-bit value + 16-bit count = 6 B Bloom filter S_i : 8192 bits = 1.0 KiB Counters ctr_i : 4 x 4-byte counters = 16 B Hysteresis state : transition counter = 4 B ---------- ~9.0 KiB 6.2 Per-group SRAM (ECMP group of width N = 4). Per-vantage state x 4 ~ 36 KiB Pairwise distance LUT (2 d / c) ~ 48 B L1 -> JSD_norm LUT (1024 entries) ~ 2 KiB exp(-x/k) LUT (1024 entries) ~ 2 KiB (shared across groups) -log(x) LUT (1024 entries) ~ 2 KiB (shared across groups) Sigma^{-1} (3x3 fixed-point) ~ 36 B (per group; per Sec. 4.2) -------- ~38 KiB per group (+ ~6 KiB shared LUTs) 6.3 Total SRAM for 1024 ECMP groups. 1024 groups * 38 KiB ~ 38 MiB + shared LUTs ~ 6 KiB -------------------------------------------- Total ~ 38 MiB Tofino-2 ships with ~30 MiB of SRAM total in 20 stages. 38 MiB is ~25% over budget for full coverage of 1024 groups. Practical deployment options: - Top-N coverage. By Pareto, 80% of the carrier-grade traffic volume in a Tier-1 edge typically transits the top 100-200 ECMP groups. Covering only those drops the SRAM cost to ~4-8 MiB and is the recommended starting point. - Reduced sketch dimensions. CMS at 4 x 512 x 16-bit (4 KiB) and Bloom at 4096 bits (0.5 KiB) cuts the per-vantage cost to ~5 KiB and the 1024-group cost to ~21 MiB, fitting Tofino-2. - Hybrid software / hardware. The 1024-group fully-dimensioned case fits comfortably on a software target (VPP/DPDK on commodity x86 with 64+ GiB RAM) and is the recommended reference profile for early validation. 6.4 Per-tick stage budget. Bundle close-out + C_1, C_2, C_3 + H, Phi_D, Phi_K ~ 14 stages Forwarding + ACL + IOAM TLV insert ~ 6 stages -------- Total 20 stages (Tofino-2 max). 6.5 Control-plane bandwidth. Sigma^{-1} updates: 6 unique entries x 4 bytes per group, every 30 seconds = 0.8 bits per second per group. Negligible. Phi_K event Packet-Ins: peak rate during a transition <100 events per group per second. For 1024 covered groups in worst case, <100k events/s -- well within standard Tofino-2 control-plane gRPC capacity (~1M events/s). 6.6 Datapath latency overhead. The MVPS computation runs on egress, in parallel with packet forwarding, on a sampled fraction of traffic (default: 1 in every 16 packets per vantage feeds the per-vantage state). The per-packet forwarding latency is *unchanged*. The per-tick computation latency (10 ms tick, ~1-2 ms compute) is hidden inside the tick window. ============================================================================== 7. In-band telemetry: the IOAM TLV ============================================================================== 7.1 Motivation. Operations teams need to be able to see what the data plane has decided, in real time, without polling the data plane or relying on Packet-In throttling. The recommended mechanism is to emit the per-tick coherence vector and phase label as an IOAM Trace Option TLV (RFC 9197) inserted on a sampled fraction of egress packets. Carrier-grade collectors that already consume IOAM (Cisco DNA, Juniper Mist, Arista CloudVision, open-source InfluxDB-based stacks) can ingest the MVPS TLV without protocol changes. 7.2 TLV layout (proposed; subject to IANA registration). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TLV-Type (TBD)| Length=12 | Vantage-Group-Id (16-bit) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | C_1 (Q1.10) | C_2 (Q1.10) | C_3 (Q1.10) | reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Phi_D (Q1.10) | Phi_K (8-bit)| reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | tick_id (32-bit) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Total: 12 bytes. Field semantics. TLV-Type : to be assigned by IANA from the IOAM Trace-Type registry. Length : fixed at 12 bytes for this revision. Vantage-Group-Id : opaque 16-bit identifier for the ECMP / queue / port group in question. Mapped by the operator to a meaningful name (e.g. "AWS sa-east-1 IX.br SP") via control- plane configuration. C_1, C_2, C_3 : Q1.10 fixed-point coherence values in [0, 1]. Phi_D : Q1.10 fixed-point phase distance, exp- weighted Mahalanobis distance. Phi_K : 8-bit enum: 0 = BAU 1 = WATCH 2 = ALARM 3 = CRITICAL (4-255 reserved) tick_id : monotonic 32-bit tick counter; wraps every ~497 days at 10 ms tick. 7.3 Sampling. The TLV is inserted on 1 packet in every M egress packets per group, with M configurable per group (default M = 64). At 100 Gbps line rate per group, this yields ~1.5M annotated packets per second per group, sufficient for sub-second dashboarding without significant header-overhead amplification (12 bytes / 1500-byte packet ~= 0.8% overhead on annotated packets, ~0.012% averaged across the group). 7.4 Privacy and operational considerations. The MVPS TLV exposes operationally-sensitive information about the internal state of the forwarding plane to any party that can observe the egress packet. Standard IOAM hygiene applies: the TLV MUST be stripped at trust-domain boundaries (typically the administrative AS edge) and SHOULD be encrypted or authenticated when crossing trust domains internally. RFC 9322 (IOAM Deployment Considerations) gives the canonical guidance. ============================================================================== 8. Formal mapping back to v1.1: Poincare's "art" ============================================================================== The central claim of this profile is that the algebraic structure of the math companion (v1.1) is preserved under the substitution of vantage type. This section makes that claim formal. 8.1 The bundle as an algebraic object. In the math companion (Sec. 1), an MVPS bundle is an element of B := V^N where V is the type of vantage (a record over hop list, RTT vector, geographic anchor, ASN, optional metadata) and N >= 2. The coherence axes are defined as functions C_1, C_2, C_3 : B -> [0, 1]. The Hamiltonian is a function H : [0, 1]^3 -> [0, infinity) with H(c_1, c_2, c_3) = -log(c_1 * c_2 * c_3). The phase label is a function Phi_K : [0, infinity) -> {BAU, WATCH, ALARM, CRITICAL}. 8.2 The substitution. The data-plane profile defines a new vantage type V' (Sec. 2) whose record is V' := ( CountMin x RttEstimator x BloomFilter x Counters ). The coherence axes are reimplemented as functions C_1', C_2', C_3' : (V')^N -> [0, 1] by the constructions in Sec. 3.1 - 3.3. These constructions differ from the math companion's constructions only in implementation substrate (fixed-point lookup tables, Count-Min sketches, Bloom filters); they share the same input-output specification: - C_1'(b') agrees with C_1(b) up to fixed-point quantisation and table approximation (bounded error ~6%) when b' encodes the same vantage observations as b. - C_2'(b') agrees with C_2(b) up to the L1 -> JSD_norm table approximation (bounded error ~5% in the JSD ranges that matter for phase detection, after deployment-time calibration). - C_3'(b') agrees with C_3(b) up to Bloom-filter false-positive rate (bounded by deployment-time choice of m and k, default <2%). H, Phi_D, Phi_K are unchanged: they are defined on (C_1, C_2, C_3) in [0, 1]^3 and do not care whether those values were computed from a JSON bundle or from on-chip sketches. 8.3 What this means. The framework's value proposition does not depend on the vantage being external, internal, geographic, optical, virtual, or any other concrete instantiation. As long as a candidate vantage type V'' admits (i) a notion of pairwise causal compatibility (for C_1), (ii) a notion of empirical flow distribution (for C_2), and (iii) a notion of return-path / topology set (for C_3), the same axiomatic framework applies and the same Phi_K phase label is produced. This is the precise sense in which Poincare's maxim -- "the art of giving the same name to different things" -- describes what the framework does. 8.4 Other vantage types this framework already covers (without algebraic change). - 5G UPF instances across network slices. - Inter-satellite-link neighbours in a low-Earth-orbit mesh (Starlink-class). - Optical fibre pairs landing on a submarine cable shore station. - Replicas of an anycast service (DNS root, CDN edge). - Threads in a software dataplane (VPP, DPDK). - Virtual interfaces in a Kubernetes CNI mesh. Each of these is a deployment study. The mathematics is reused verbatim. This profile (P4 next-hop vantages on a peering edge) is the simplest first step in that catalogue. ============================================================================== 9. Open questions and validation roadmap ============================================================================== The following items must be resolved before this profile can be promoted to a full Internet-Draft submission. Each is presented as a numbered open work item D9.x referenced from the body above. D9.1 Reference P4 implementation. Status : not started. Scope : a complete P4_16 reference implementation of the Sec. 3 axes targeting Tofino-2 SDE 9.x. Ships with a software simulator (bmv2) for CI. Risk : moderate. The P4 idioms used are standard; the main risk is exceeding the stage budget of 14 stages for the MVPS pipeline once forwarding logic is integrated. Sec. 6.4 budget assumes forwarding ~6 stages; some carrier-grade deployments use up to 12 stages for forwarding alone, which would force the MVPS pipeline onto a second pass or onto a separate Tofino pipe. D9.2 Hardware bench validation. Status : not started. Scope : end-to-end bench with two Tofino-2 chassis, traffic generator, and an injected gray-failure fault. Measure detection latency, false-alarm rate, and resource utilisation against the budget in Sec. 6. Risk : access-bound, not technically. Tofino-2 bench hardware is not in the catellix.com lab today. D9.3 Calibration of L1 -> JSD_norm lookup. Status : conceptual. Scope : empirically fit the lookup table on at least three production sites (a peering edge, a service edge, a metro core) over at least one week each, and characterise the residual error against software-computed JSD_norm. Risk : low technically; depends on operator data access agreements. D9.4 Sigma^{-1} drift and recalibration cadence. Status : conceptual. Scope : characterise how fast Sigma^{-1} drifts under normal diurnal traffic patterns, and choose a recalibration cadence that minimises false alarms without missing real events. The recommended starting cadence (30 seconds) is a first-order guess. Risk : low; this is a standard observability problem. D9.5 IOAM TLV registration and interop. Status : not started. Scope : IANA registration of the TLV-Type, alignment with the IETF IOAM working group on TLV semantics, and interop test against at least two third-party IOAM collectors. Risk : process-bound; technical risk negligible. D9.6 Comparative evaluation against existing dataplane signals. Status : not started. Scope : measure detection latency and false-alarm rate of MVPS embedded against existing per-flow dataplane signals (Linux RACK, P4-based microburst detectors, BFD, S-BFD, IETF SAVNET telemetry) on the same fault catalogue. Risk : low; this is the academic-publication track. D9.7 Conjecture-T1 invariance under hardware quantisation. Status : conceptual. Scope : the math companion's Conjecture T1 (det(Sigma) invariance under equilibrium) is stated for idealised real-valued C_i. Verify whether it holds, approximately, when C_i are Q1.10 fixed-point and the sketches introduce deployment-time bias. If not, characterise the bias and add a v1.2 erratum. Risk : moderate. This is the most theoretically interesting open item and is a natural thesis-chapter problem. D9.8 Companion I-D draft. Status : this document is the seed. Scope : convert this profile to RFC 7322 I-D format, align section numbering with IETF style, and submit as draft-melegassi-ippm-mvps-dataplane -00. Risk : low. ============================================================================== 10. References ============================================================================== Normative. [MVPS-MATH] Melegassi, L. "MVPS -- Three-Layer Mathematical Structure". Catellix Research, v1.1, 2026-05-20. Available at: https://catellix.com/static/download/ MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt [MVPS-BUNDLE] Melegassi, L. "The MVPS Bundle". draft-melegassi-ippm-mvps-bundle, work in progress. [RFC9197] Brockners, F. et al. "Data Fields for In Situ Operations, Administration, and Maintenance (IOAM)". RFC 9197, May 2022. [RFC9322] Mizrahi, T. et al. "IOAM Deployment". RFC 9322, November 2022. Informative. [P4_16] P4 Language Consortium. "P4_16 Language Specification, v1.2.4". 2023. [TOFINO2] Intel Corporation. "Intel Tofino-2 Native Architecture (TNA) Reference Manual". 2022. [IOAM-INT] Bhandari, S. et al. "Inband Network Telemetry (INT) Specification, v2.1". P4 Applications Working Group, 2020. [LIN1991] Lin, J. "Divergence Measures Based on the Shannon Entropy". IEEE Trans. Inf. Theory, 37(1):145-151, 1991. [POINCARE] Poincare, H. "Science et Methode". Flammarion, Paris, 1908. ("L'art de donner le meme nom a des choses differentes...") [SCHEFFER] Scheffer, M. et al. "Early-warning signals for critical transitions". Nature, 461:53-59, 2009. [CMS-COR] Cormode, G. and Muthukrishnan, S. "An improved data stream summary: the count-min sketch and its applications". J. Algorithms, 55(1):58-75, 2005. [BLOOM1970] Bloom, B. H. "Space/time trade-offs in hash coding with allowable errors". Communications of the ACM, 13(7):422-426, 1970. ============================================================================== Document history ============================================================================== v0.1 2026-05-21 Initial draft. Companion to MVPS_THREE_LAYER_MATHEMATICAL_EVIDENCE.txt v1.1. Status: proposal, not implemented. Authors: L. Melegassi (Catellix Research). ============================================================================== End of document ==============================================================================