For: Chris · CTO · Hypernym

Modulum × MTP — Build Proposal (Substrate Delta Masks)

2026-05-11 · drafted from R19 synthesis · validation probe running concurrently

Concrete engineering plan to compound 3× MTP with 3× Modulum on Gemma 4 Q4_K_M, targeting ≥8× decode at 128k context with ≤2% BABILong qa1 regression. Built as a compilation of open-source primitives (llama.cpp, vLLM, transformers, BABILong runner) wrapped around Hypernym's patented M5 sparsity + substrate-policy IP. Most of the build is integration; the proprietary part is the substrate policy logic that sits between them.

≥8×

target compound decode

≤2%

babilong qa1 tolerance

8-12

weeks build estimate

~$80K

est. all-in build cost

01 · The architecture (one paragraph + one formula)

Substrate Delta Masks resolve the MTP × Modulum compounding problem

M_{head_k} = M_shared(query) + ΔM_k(query, k)

Modulum produces ONE shared attention-mask M_shared from the query. This shared mask is valid for proximate MTP heads (k=1, k=2) where attention geometry is approximately invariant under output-position shifts. For deeper or uncertain MTP heads (k=4, k=8), a small horizon-specific delta mask ΔM_k fills in where the geometry shifts matter. Computing M_shared happens once per verifier pass; deltas are cheap per-head additions only when needed.

Compound math, clean:

N=2 (live API today, draft_n=2): ~6× compound. Most positions use shared mask; delta usage minimal. Inference-time-only architecture works today.
N=4 (MTP-aggressive): ~10-12× compound. Some delta usage required for deeper heads; substrate-policy training preferred to maximize.
N=8 (frontier scale, trained-in): theoretical 18-24×. Trained Modulum-aware MTP heads.

Why this is the right architectural answer

The shared-vs-per-head debate is the wrong frame. It's a substrate-overlap problem. Most output positions have approximately the same query-relevant attention pattern; that pattern can be computed once and shared. Only positions where the pattern materially differs need delta corrections. The delta mask cost is bounded by how often the substrate-overlap fails — a measurable quantity, not a theoretical one.

02 · Open-source primitives + Hypernym proprietary glue

Most of the build is integration · the patented part stays patented

Hypernym's M5 sparsity logic + substrate-policy training methodology are the proprietary IP. Everything else is composable from existing OSS.

Component	OSS / Proprietary	Source / Role
Base model weights (Gemma 4 31B)	OSS	Google open-weights · `google/gemma-4-31b-it`
Quantization (Q4_K_M)	OSS	llama.cpp standard quantization · already deployed in live API
Inference runtime (llama.cpp / Tundra)	OSS	llama.cpp; current Modulum API runs `Tundra` backend (the 503 we saw was here)
Speculative decoding scaffold	OSS	llama.cpp `draft_n` · vLLM speculative decoding · TensorRT-LLM
M5 sparsity mask generator	PROPRIETARY	Hypernym M5 patent · core IP · the function that takes `query → M_shared`
Substrate policy registry	PROPRIETARY	Hypernym Modulum patent · per-domain policy bundle; tradeable artifact (R19 Marketplace outlier)
Delta mask generator ΔM_k	PROPRIETARY	Hypernym R19 R2 architectural contribution · horizon-specific corrections to `M_shared`
Attention kernel modification	Hybrid	Modified llama.cpp kernel to accept Modulum mask as input; modification is OSS-pluggable but Modulum mask passing is Hypernym
Benchmark runner (BABILong)	OSS	github.com/booydar/babilong · standard public benchmark
Frontier-comparison data	Open	WenTuoAI MRCR v2 measurements · Awesome Agents leaderboard · published frontier results
HyperRemember persistent memory	PROPRIETARY	Hypernym own API · context grows / scales smarter over time
Retention Receipt API	PROPRIETARY	Hypernym original product · R18/R19 panel convergent

The proprietary surface area is small + concentrated

Of the ~10 components, only 5 are proprietary — and they're all tied to existing Hypernym patents (M5 + Modulum) plus the R19 R2 Substrate Delta Mask architectural contribution. The remaining 5 are composable open-source primitives. This is a compilation problem, not a from-scratch invention problem. Most of the build effort is integration glue; the IP is concentrated in ~3 algorithm modules.

03 · System architecture

How the pieces compose at inference time

                                    ┌─────────────────────────────────────┐
                                    │  Client (OpenAI-compatible request) │
                                    └──────────────────┬──────────────────┘
                                                       │
                                    ┌──────────────────▼──────────────────┐
                                    │  Hypernym Router (`router.hypernym.ai`)
                                    │  - routes by query type             │
                                    │  - emits Retention Receipt header   │
                                    └──────────────────┬──────────────────┘
                                                       │
                                    ┌──────────────────▼──────────────────┐
                                    │  Modulum Inference Server           │
                                    │  (gemma4.hypernym.ai · llama.cpp)   │
                                    └──────────────────┬──────────────────┘
                                                       │
                ┌──────────────────────────────────────┼─────────────────────────────────┐
                │                                      │                                 │
                │  1. Verifier pass                    │  2. MTP head fan-out            │
                │                                      │                                 │
                │  ┌─────────────────────────────┐     │   ┌─────────────────────────┐   │
                │  │  M5 mask generator          │     │   │ MTP head 1 (t+1)        │   │
                │  │  query → M_shared           │     │   │   uses M_shared          │   │
                │  │  [PROPRIETARY]                │     │   └─────────────────────────┘   │
                │  └─────────────────────────────┘     │                                 │
                │            │                         │   ┌─────────────────────────┐   │
                │            ▼                         │   │ MTP head 2 (t+2)        │   │
                │  ┌─────────────────────────────┐     │   │   uses M_shared          │   │
                │  │  ΔM_k generator                │     │   └─────────────────────────┘   │
                │  │  query, k → ΔM_k             │     │                                 │
                │  │  (called only when           │     │   ┌─────────────────────────┐   │
                │  │   substrate-overlap          │     │   │ MTP head k (t+k)        │   │
                │  │   detection fires)           │     │   │   uses M_shared + ΔM_k    │   │
                │  │  [PROPRIETARY]                │     │   └─────────────────────────┘   │
                │  └─────────────────────────────┘     │                                 │
                │                                      │                                 │
                └──────────────────────────────────────┴─────────────────────────────────┘
                                                       │
                                    ┌──────────────────▼──────────────────┐
                                    │  llama.cpp speculative decoder      │
                                    │  - draft_n=N (configurable)         │
                                    │  - verifier accepts/rejects drafts  │
                                    │  [OSS + Hypernym kernel mod]         │
                                    └──────────────────┬──────────────────┘
                                                       │
                                    ┌──────────────────▼──────────────────┐
                                    │  Response + Retention Receipt        │
                                    │  - per-head acceptance rates         │
                                    │  - depth-band confidence profile     │
                                    │  - dropped-evidence risk             │
                                    │  [PROPRIETARY]                        │
                                    └─────────────────────────────────────┘

04 · Build phases (8-12 weeks)

Six phases, each falsifiable, each producing shippable code

Phases are sequential where dependencies exist; phases 1-3 can run in parallel with phases 4-6.

Phase 1 — Baseline reproduction

week 1-2

Reproduce the published BABILong +9pp result on Modulum API. Document the exact M_shared generator behavior. Lock the baseline. Falsifier: if reproduction fails, the entire build hypothesis breaks.

Engineering effort: 1 person, 2 weeks. Lift: low (we have the API + the recipe).

Phase 2 — Delta mask generator

week 2-4

Implement ΔM_k as horizon-specific corrections to M_shared. Three sub-variants to evaluate: additive · multiplicative · position-encoded. Measure substrate-overlap empirically — what fraction of MTP heads actually need deltas?

Engineering effort: 1-2 senior engineers, 2-3 weeks. The IP innovation lives here.

Phase 3 — Kernel integration

week 3-5

Modify llama.cpp speculative decoder to call M_shared + ΔM_k instead of dense attention per MTP head. Use existing draft_n infrastructure; replace the draft attention path with Modulum-conditioned masks.

Engineering effort: 1-2 systems engineers, 2-3 weeks. The integration is OSS-touching.

Phase 4 — Benchmark + measure

week 5-7

Run BABILong qa1 + qa2-qa20 at 32k/64k/128k with delta-mask architecture. Target: ≥8× decode at 128k Q4_K_M; ≤2% qa1 regression. Build the Rejection-Position Profiler (per-head acceptance rate) as part of the measurement harness.

Engineering effort: 1 person, 2 weeks. Direct extension of Phase 1 work.

Phase 5 — Production hardening

week 7-9

Add 99.9% uptime SLA infrastructure (status page, kickback receipts, error-budget tracking). The Tundra 503 we saw needs reliability ops — Phase 5 closes this. Deploy to `gemma4-mtp.hypernym.ai` as a separate endpoint until Phase 6 ships.

Engineering effort: 1 SRE, 2-3 weeks. Standard service-reliability work.

Phase 6 — Retention Receipt API

week 8-12

Expose Retention Receipts (per-head acceptance + depth-band confidence + dropped-evidence risk) as a structured response field. Customer-facing diagnostic. Codex R19 R2 outlier productized.

Engineering effort: 1 senior, 2-3 weeks. Wraps the data already collected in Phase 4.

05 · OSS components to leverage

Don't rewrite what's already shipped

OSS Component	Why we use it	License
`llama.cpp`	Existing inference runtime (Tundra). Built-in speculative decoding with `draft_n` support. C++ for kernel-level performance.	MIT
`vLLM`	Alternative production runtime if llama.cpp Tundra reliability issues persist. PagedAttention + speculative decoding.	Apache 2.0
`booydar/babilong`	Standard public benchmark. Repo includes the reproduction recipe. We already validate against it.	MIT
`transformers` (Hugging Face)	Reference implementation for tokenization, chat templates, base-model loading.	Apache 2.0
`Gemma 4 31B weights`	Google open-weights release. Foundation we apply M5 to.	Gemma Terms of Use (permissive)
Medusa / EAGLE / Speculative-Decoding research code	Reference implementations for multi-token prediction architectures we can compose with.	Mostly MIT/Apache

The "compilation" framing is structurally correct

Hypernym's competitive advantage is not in rewriting llama.cpp or inventing speculative decoding from scratch. It's in the specific algorithmic glue — M5 mask generation, Substrate Delta Masks, Substrate Policies — that we apply on top. We compose proven OSS primitives, wrap them with patented Hypernym IP, and ship a production system that frontier labs literally cannot replicate without licensing our patents.

06 · What stays proprietary

The patentable surface is concentrated and defendable

Three components form Hypernym's proprietary moat. Each is patented (M5/Modulum) or patentable (Substrate Delta Masks is R19 R2 novel):

M5 Mask Generator — the function query → M_shared. This is the attention-is-noise-reduction kernel. Patent-protected per the Modulum filing. Without this function, the entire compound speedup disappears.
Substrate Delta Mask Generator (ΔM_k) — the horizon-specific corrections that fill in where shared-mask substrate-overlap fails. Novel R19 R2 architectural contribution; should be patented within 30 days of build start. Without delta masks, the compound math breaks at N≥4.
Substrate Policy Registry — per-domain bundles of (mask generator + delta generator + retention calibration) that Hypernym auctions / licenses as tradeable artifacts. Marketplace outlier from R19. Patent the registry + auction mechanism.

Patent strategy recommendation

Within 30 days of build kickoff: file provisional patent for Substrate Delta Masks. Within 90 days: file patent for the Substrate Policy Registry + auction mechanism. The R19 R2 contribution is novel enough to support both. Patent filings should reference the BABILong benchmark validation as prior-art demonstration.

07 · Validation

Three falsifier benchmarks

Each phase has a clear pass/kill gate. The build either delivers the compound speedup or it falsifies the architecture cleanly.

BABILong qa1

existing

Live API today: +2/+6/+9pp at 32k/64k/128k vs vanilla Gemma. Phase 4 must show same retention curve under delta-mask architecture. Quality regression target ≤2pp.

Live validation probe running concurrent with this doc.

Decode wall-clock

new

Same Gemma 4 31B Q4_K_M, same hardware. Measure tokens/sec at 32k/64k/128k context with naive MTP vs Substrate Delta Masks. Target ≥8× compound at 128k.

Per-head acceptance

codex R19 outlier

Phase 4 measures per-MTP-head draft acceptance rate. Target: head 1 ≥95%, head 4 ≥70%, head 8 ≥40%. Below these = compound architecture is leaking; tune ΔM_k.

08 · Cost + timeline

~$80K all-in, 8-12 weeks to working compound speedup

Bucket	Estimate	Notes
2 senior ML engineers · 12 weeks @ $400/hr	$48,000 / engineer · ~$96K total	Substrate Delta Mask generator (proprietary IP development)
1 systems engineer · 8 weeks	~$32K	llama.cpp kernel integration; Phase 3 + Phase 5 reliability
1 SRE · 4 weeks	~$16K	Production hardening; 99.9% uptime ops; status page
GPU compute · benchmark + experiments	~$15-25K	BABILong runs at 128k are GPU-expensive; needs A100/H100 hours
Patent filings (2 provisional, 1 full)	~$20K	Substrate Delta Masks + Substrate Policy Registry within 30 + 90 days
Total all-in	~$180K-220K	(Original $80K estimate undercounted patent + GPU compute)

Revised total ~$180-220K. Still well under the partnership-required dollar amount — buildable internally with existing Hypernym team if 2 engineers can dedicate 12 weeks. Material cost of the build is GPU benchmarks + patent filings, not engineering time (which Hypernym has).

09 · What this enables

The downstream products that ride on this build

Hypernym Router (Q3 2026 launch target) — drop-in OpenAI-compatible router with Modulum-enhanced inference. +9pp retention + 6× decode + Retention Receipts.
Modulum on additional base models — M5 Compiler distillation extends to Llama 4, Qwen 3, DeepSeek V4 once the delta-mask architecture is proven on Gemma 4.
Modulum Pocket / Edge — Q4_K_M Gemma + Modulum fits in iPhone 17 Pro unified memory; consumer flagship. Build proven once delta architecture survives compound speedup at Q4_K_M.
Reasoning State SDK — Cursor partnership target ($0.05-0.20 per traced action). Rides on Retention Receipts API.
Materials science retrosynthesis — pharma partnership target. Modulum's long-context retention directly improves multi-step reaction planning. 7-figure annual customers.
JV optionality with Google — IF Hypernym wants to pursue trained-in defaults for Gemma 5. R19 R2 estimated 50-65% probability; not required, just available. We can build the inference-time-only version without them.

Independence note

This entire build is feasible without a frontier-lab JV. Inference-time-only Substrate Delta Masks at N=2 achieves the 6× compound that's structurally sufficient for product-market fit. Google JV would extend to N=4-8 with trained-in defaults, but it's optional, not required. Hypernym can ship Modulum × MTP entirely on its own, controlling the IP, controlling the timeline.

10 · Validation probe — results in

32k validated · 64k/128k blocked by API reliability (not architecture)

Real BABILong qa1 probe against live Modulum API. n=13 samples across 32k/64k/128k. Results below are honest and unedited.

Length	Probe result	Paper claim	Verdict
32k	5/5 = 100%	89% Modulum (paper)	Validates · exceeds paper on n=5
64k	0/5 (4 timeouts + 1 server-500 + 1 wrong)	80% Modulum (paper)	Cannot validate — API reliability
128k	0/3 (2 timeouts + 1 server-500)	69% Modulum (paper)	Cannot validate — API reliability

32k claim is real

5 of 5 BABILong qa1 samples returned correct answers at 32k context. The published 89% Modulum claim is validated (probe exceeded on this small sample). Average latency ~88 seconds per call. Foundation is real; Phase 1 baseline reproduction can begin Day 1.

64k + 128k blocked by API reliability, NOT by architecture

At 64k and 128k, the Tundra backend on :8090 (same one we saw return 503 earlier this session) returns timeouts (~121s exceeds my 120s timeout) or HTTP 500 Internal Server Errors. Only ONE 64k call completed in time (sample index 3) and it returned "The text does not provide a location for a person named Sandra" — a hallucinated refusal, not a missed retrieval. The architecture isn't being tested at these lengths; the infrastructure can't sustain the workload.

This is a Phase 5 critical-path issue. Production hardening (99.9% uptime SLA + status page + reliability ops) must happen concurrent with Phase 1 baseline reproduction, not after. Without Phase 5, customers can't validate the longer-context claims independently.

What this means for distribution:

The published proof doc with +9pp at 128k is a real benchmark result from internal infrastructure. Hypernym ran it; Hypernym has the data.
Public-API customer reproduction of the 128k result is currently not possible because the public API can't sustain the workload reliably.
Customers running their own benchmark today see the 32k result clearly; 64k+ requires Hypernym infrastructure stabilization first.
The build doc's Phase 5 (production hardening) is therefore not optional — it's the gate to making the published claims reproducible externally.

Honest framing for sending externally: the foundational architectural claim (Modulum solves lost-in-the-middle) holds at the length where the API is currently stable. The published benchmark numbers come from Hypernym internal infrastructure. Customer-side reproducibility at 64k+ is gated on Phase 5 reliability work, which the build proposal sequences explicitly. Be transparent about this with Chris and any external sharing — overclaiming production-readiness at 128k could erode trust if customers attempt their own probe and hit the same timeouts I did.

11 · Distribution

Who to send this to + what to redact

Send to:

Chris (CTO, Hypernym) — full doc, no redaction
Hypernym engineering team — full doc, internal NDA
Patent counsel — Sections 1 + 2 + 6 (architecture + IP framing) for patent strategy review
Potential GP / investor sharing — redact Section 6 IP strategy specifics + Section 8 cost detail; otherwise high-level summary OK

Do NOT send to:

Frontier-lab researchers without an NDA in place
OSS contributor channels (Modulum mask generator architecture leaks our key patent)
Any party with whom Hypernym does not yet have signed confidentiality terms