Confidential · Hypernym Internal · Engineering Build Proposal · 2026-05-11

For: Chris · CTO · Hypernym

Modulum × MTP — Build Proposal (Substrate Delta Masks)

2026-05-11 · drafted from R19 synthesis · validation probe running concurrently

Concrete engineering plan to compound 3× MTP with 3× Modulum on Gemma 4 Q4_K_M, targeting ≥8× decode at 128k context with ≤2% BABILong qa1 regression. Built as a compilation of open-source primitives (llama.cpp, vLLM, transformers, BABILong runner) wrapped around Hypernym's patented M5 sparsity + substrate-policy IP. Most of the build is integration; the proprietary part is the substrate policy logic that sits between them.

≥8×
target compound decode
≤2%
babilong qa1 tolerance
8-12
weeks build estimate
~$80K
est. all-in build cost
01 · The architecture (one paragraph + one formula)

Substrate Delta Masks resolve the MTP × Modulum compounding problem

Mhead_k = M_shared(query) + ΔM_k(query, k)

Modulum produces ONE shared attention-mask M_shared from the query. This shared mask is valid for proximate MTP heads (k=1, k=2) where attention geometry is approximately invariant under output-position shifts. For deeper or uncertain MTP heads (k=4, k=8), a small horizon-specific delta mask ΔM_k fills in where the geometry shifts matter. Computing M_shared happens once per verifier pass; deltas are cheap per-head additions only when needed.

Compound math, clean:

Why this is the right architectural answer

The shared-vs-per-head debate is the wrong frame. It's a substrate-overlap problem. Most output positions have approximately the same query-relevant attention pattern; that pattern can be computed once and shared. Only positions where the pattern materially differs need delta corrections. The delta mask cost is bounded by how often the substrate-overlap fails — a measurable quantity, not a theoretical one.

02 · Open-source primitives + Hypernym proprietary glue

Most of the build is integration · the patented part stays patented

Hypernym's M5 sparsity logic + substrate-policy training methodology are the proprietary IP. Everything else is composable from existing OSS.

ComponentOSS / ProprietarySource / Role
Base model weights (Gemma 4 31B)OSSGoogle open-weights · google/gemma-4-31b-it
Quantization (Q4_K_M)OSSllama.cpp standard quantization · already deployed in live API
Inference runtime (llama.cpp / Tundra)OSSllama.cpp; current Modulum API runs Tundra backend (the 503 we saw was here)
Speculative decoding scaffoldOSSllama.cpp draft_n · vLLM speculative decoding · TensorRT-LLM
M5 sparsity mask generatorPROPRIETARYHypernym M5 patent · core IP · the function that takes query → M_shared
Substrate policy registryPROPRIETARYHypernym Modulum patent · per-domain policy bundle; tradeable artifact (R19 Marketplace outlier)
Delta mask generator ΔM_kPROPRIETARYHypernym R19 R2 architectural contribution · horizon-specific corrections to M_shared
Attention kernel modificationHybridModified llama.cpp kernel to accept Modulum mask as input; modification is OSS-pluggable but Modulum mask passing is Hypernym
Benchmark runner (BABILong)OSSgithub.com/booydar/babilong · standard public benchmark
Frontier-comparison dataOpenWenTuoAI MRCR v2 measurements · Awesome Agents leaderboard · published frontier results
HyperRemember persistent memoryPROPRIETARYHypernym own API · context grows / scales smarter over time
Retention Receipt APIPROPRIETARYHypernym original product · R18/R19 panel convergent

The proprietary surface area is small + concentrated

Of the ~10 components, only 5 are proprietary — and they're all tied to existing Hypernym patents (M5 + Modulum) plus the R19 R2 Substrate Delta Mask architectural contribution. The remaining 5 are composable open-source primitives. This is a compilation problem, not a from-scratch invention problem. Most of the build effort is integration glue; the IP is concentrated in ~3 algorithm modules.

03 · System architecture

How the pieces compose at inference time

┌─────────────────────────────────────┐ │ Client (OpenAI-compatible request) │ └──────────────────┬──────────────────┘ │ ┌──────────────────▼──────────────────┐ │ Hypernym Router (`router.hypernym.ai`) │ - routes by query type │ │ - emits Retention Receipt header │ └──────────────────┬──────────────────┘ │ ┌──────────────────▼──────────────────┐ │ Modulum Inference Server │ │ (gemma4.hypernym.ai · llama.cpp) │ └──────────────────┬──────────────────┘ │ ┌──────────────────────────────────────┼─────────────────────────────────┐ │ │ │ │ 1. Verifier pass │ 2. MTP head fan-out │ │ │ │ │ ┌─────────────────────────────┐ │ ┌─────────────────────────┐ │ │ │ M5 mask generator │ │ │ MTP head 1 (t+1) │ │ │ │ query → M_shared │ │ │ uses M_shared │ │ │ │ [PROPRIETARY] │ │ └─────────────────────────┘ │ │ └─────────────────────────────┘ │ │ │ │ │ ┌─────────────────────────┐ │ │ ▼ │ │ MTP head 2 (t+2) │ │ │ ┌─────────────────────────────┐ │ │ uses M_shared │ │ │ │ ΔM_k generator │ │ └─────────────────────────┘ │ │ │ query, k → ΔM_k │ │ │ │ │ (called only when │ │ ┌─────────────────────────┐ │ │ │ substrate-overlap │ │ │ MTP head k (t+k) │ │ │ │ detection fires) │ │ │ uses M_shared + ΔM_k │ │ │ │ [PROPRIETARY] │ │ └─────────────────────────┘ │ │ └─────────────────────────────┘ │ │ │ │ │ └──────────────────────────────────────┴─────────────────────────────────┘ │ ┌──────────────────▼──────────────────┐ │ llama.cpp speculative decoder │ │ - draft_n=N (configurable) │ │ - verifier accepts/rejects drafts │ │ [OSS + Hypernym kernel mod] │ └──────────────────┬──────────────────┘ │ ┌──────────────────▼──────────────────┐ │ Response + Retention Receipt │ │ - per-head acceptance rates │ │ - depth-band confidence profile │ │ - dropped-evidence risk │ │ [PROPRIETARY] │ └─────────────────────────────────────┘
04 · Build phases (8-12 weeks)

Six phases, each falsifiable, each producing shippable code

Phases are sequential where dependencies exist; phases 1-3 can run in parallel with phases 4-6.

Phase 1 — Baseline reproduction

week 1-2

Reproduce the published BABILong +9pp result on Modulum API. Document the exact M_shared generator behavior. Lock the baseline. Falsifier: if reproduction fails, the entire build hypothesis breaks.

Engineering effort: 1 person, 2 weeks. Lift: low (we have the API + the recipe).

Phase 2 — Delta mask generator

week 2-4

Implement ΔM_k as horizon-specific corrections to M_shared. Three sub-variants to evaluate: additive · multiplicative · position-encoded. Measure substrate-overlap empirically — what fraction of MTP heads actually need deltas?

Engineering effort: 1-2 senior engineers, 2-3 weeks. The IP innovation lives here.

Phase 3 — Kernel integration

week 3-5

Modify llama.cpp speculative decoder to call M_shared + ΔM_k instead of dense attention per MTP head. Use existing draft_n infrastructure; replace the draft attention path with Modulum-conditioned masks.

Engineering effort: 1-2 systems engineers, 2-3 weeks. The integration is OSS-touching.

Phase 4 — Benchmark + measure

week 5-7

Run BABILong qa1 + qa2-qa20 at 32k/64k/128k with delta-mask architecture. Target: ≥8× decode at 128k Q4_K_M; ≤2% qa1 regression. Build the Rejection-Position Profiler (per-head acceptance rate) as part of the measurement harness.

Engineering effort: 1 person, 2 weeks. Direct extension of Phase 1 work.

Phase 5 — Production hardening

week 7-9

Add 99.9% uptime SLA infrastructure (status page, kickback receipts, error-budget tracking). The Tundra 503 we saw needs reliability ops — Phase 5 closes this. Deploy to `gemma4-mtp.hypernym.ai` as a separate endpoint until Phase 6 ships.

Engineering effort: 1 SRE, 2-3 weeks. Standard service-reliability work.

Phase 6 — Retention Receipt API

week 8-12

Expose Retention Receipts (per-head acceptance + depth-band confidence + dropped-evidence risk) as a structured response field. Customer-facing diagnostic. Codex R19 R2 outlier productized.

Engineering effort: 1 senior, 2-3 weeks. Wraps the data already collected in Phase 4.

05 · OSS components to leverage

Don't rewrite what's already shipped

OSS ComponentWhy we use itLicense
llama.cppExisting inference runtime (Tundra). Built-in speculative decoding with draft_n support. C++ for kernel-level performance.MIT
vLLMAlternative production runtime if llama.cpp Tundra reliability issues persist. PagedAttention + speculative decoding.Apache 2.0
booydar/babilongStandard public benchmark. Repo includes the reproduction recipe. We already validate against it.MIT
transformers (Hugging Face)Reference implementation for tokenization, chat templates, base-model loading.Apache 2.0
Gemma 4 31B weightsGoogle open-weights release. Foundation we apply M5 to.Gemma Terms of Use (permissive)
Medusa / EAGLE / Speculative-Decoding research codeReference implementations for multi-token prediction architectures we can compose with.Mostly MIT/Apache

The "compilation" framing is structurally correct

Hypernym's competitive advantage is not in rewriting llama.cpp or inventing speculative decoding from scratch. It's in the specific algorithmic glue — M5 mask generation, Substrate Delta Masks, Substrate Policies — that we apply on top. We compose proven OSS primitives, wrap them with patented Hypernym IP, and ship a production system that frontier labs literally cannot replicate without licensing our patents.

06 · What stays proprietary

The patentable surface is concentrated and defendable

Three components form Hypernym's proprietary moat. Each is patented (M5/Modulum) or patentable (Substrate Delta Masks is R19 R2 novel):

Patent strategy recommendation

Within 30 days of build kickoff: file provisional patent for Substrate Delta Masks. Within 90 days: file patent for the Substrate Policy Registry + auction mechanism. The R19 R2 contribution is novel enough to support both. Patent filings should reference the BABILong benchmark validation as prior-art demonstration.

07 · Validation

Three falsifier benchmarks

Each phase has a clear pass/kill gate. The build either delivers the compound speedup or it falsifies the architecture cleanly.

BABILong qa1

existing

Live API today: +2/+6/+9pp at 32k/64k/128k vs vanilla Gemma. Phase 4 must show same retention curve under delta-mask architecture. Quality regression target ≤2pp.

Live validation probe running concurrent with this doc.

Decode wall-clock

new

Same Gemma 4 31B Q4_K_M, same hardware. Measure tokens/sec at 32k/64k/128k context with naive MTP vs Substrate Delta Masks. Target ≥8× compound at 128k.

Per-head acceptance

codex R19 outlier

Phase 4 measures per-MTP-head draft acceptance rate. Target: head 1 ≥95%, head 4 ≥70%, head 8 ≥40%. Below these = compound architecture is leaking; tune ΔM_k.

08 · Cost + timeline

~$80K all-in, 8-12 weeks to working compound speedup

BucketEstimateNotes
2 senior ML engineers · 12 weeks @ $400/hr$48,000 / engineer · ~$96K totalSubstrate Delta Mask generator (proprietary IP development)
1 systems engineer · 8 weeks~$32Kllama.cpp kernel integration; Phase 3 + Phase 5 reliability
1 SRE · 4 weeks~$16KProduction hardening; 99.9% uptime ops; status page
GPU compute · benchmark + experiments~$15-25KBABILong runs at 128k are GPU-expensive; needs A100/H100 hours
Patent filings (2 provisional, 1 full)~$20KSubstrate Delta Masks + Substrate Policy Registry within 30 + 90 days
Total all-in~$180K-220K(Original $80K estimate undercounted patent + GPU compute)

Revised total ~$180-220K. Still well under the partnership-required dollar amount — buildable internally with existing Hypernym team if 2 engineers can dedicate 12 weeks. Material cost of the build is GPU benchmarks + patent filings, not engineering time (which Hypernym has).

09 · What this enables

The downstream products that ride on this build

Independence note

This entire build is feasible without a frontier-lab JV. Inference-time-only Substrate Delta Masks at N=2 achieves the 6× compound that's structurally sufficient for product-market fit. Google JV would extend to N=4-8 with trained-in defaults, but it's optional, not required. Hypernym can ship Modulum × MTP entirely on its own, controlling the IP, controlling the timeline.

10 · Validation probe — results in

32k validated · 64k/128k blocked by API reliability (not architecture)

Real BABILong qa1 probe against live Modulum API. n=13 samples across 32k/64k/128k. Results below are honest and unedited.

LengthProbe resultPaper claimVerdict
32k5/5 = 100%89% Modulum (paper)Validates · exceeds paper on n=5
64k0/5 (4 timeouts + 1 server-500 + 1 wrong)80% Modulum (paper)Cannot validate — API reliability
128k0/3 (2 timeouts + 1 server-500)69% Modulum (paper)Cannot validate — API reliability

32k claim is real

5 of 5 BABILong qa1 samples returned correct answers at 32k context. The published 89% Modulum claim is validated (probe exceeded on this small sample). Average latency ~88 seconds per call. Foundation is real; Phase 1 baseline reproduction can begin Day 1.

64k + 128k blocked by API reliability, NOT by architecture

At 64k and 128k, the Tundra backend on :8090 (same one we saw return 503 earlier this session) returns timeouts (~121s exceeds my 120s timeout) or HTTP 500 Internal Server Errors. Only ONE 64k call completed in time (sample index 3) and it returned "The text does not provide a location for a person named Sandra" — a hallucinated refusal, not a missed retrieval. The architecture isn't being tested at these lengths; the infrastructure can't sustain the workload.

This is a Phase 5 critical-path issue. Production hardening (99.9% uptime SLA + status page + reliability ops) must happen concurrent with Phase 1 baseline reproduction, not after. Without Phase 5, customers can't validate the longer-context claims independently.

What this means for distribution:

Honest framing for sending externally: the foundational architectural claim (Modulum solves lost-in-the-middle) holds at the length where the API is currently stable. The published benchmark numbers come from Hypernym internal infrastructure. Customer-side reproducibility at 64k+ is gated on Phase 5 reliability work, which the build proposal sequences explicitly. Be transparent about this with Chris and any external sharing — overclaiming production-readiness at 128k could erode trust if customers attempt their own probe and hit the same timeouts I did.

11 · Distribution

Who to send this to + what to redact

Send to:

Do NOT send to: