For: Chris · CTO · Hypernym
2026-05-11 · drafted from R19 synthesis · validation probe running concurrently
Concrete engineering plan to compound 3× MTP with 3× Modulum on Gemma 4 Q4_K_M, targeting ≥8× decode at 128k context with ≤2% BABILong qa1 regression. Built as a compilation of open-source primitives (llama.cpp, vLLM, transformers, BABILong runner) wrapped around Hypernym's patented M5 sparsity + substrate-policy IP. Most of the build is integration; the proprietary part is the substrate policy logic that sits between them.
Modulum produces ONE shared attention-mask M_shared from the query. This shared mask is valid for proximate MTP heads (k=1, k=2) where attention geometry is approximately invariant under output-position shifts. For deeper or uncertain MTP heads (k=4, k=8), a small horizon-specific delta mask ΔM_k fills in where the geometry shifts matter. Computing M_shared happens once per verifier pass; deltas are cheap per-head additions only when needed.
Compound math, clean:
The shared-vs-per-head debate is the wrong frame. It's a substrate-overlap problem. Most output positions have approximately the same query-relevant attention pattern; that pattern can be computed once and shared. Only positions where the pattern materially differs need delta corrections. The delta mask cost is bounded by how often the substrate-overlap fails — a measurable quantity, not a theoretical one.
Hypernym's M5 sparsity logic + substrate-policy training methodology are the proprietary IP. Everything else is composable from existing OSS.
| Component | OSS / Proprietary | Source / Role |
|---|---|---|
| Base model weights (Gemma 4 31B) | OSS | Google open-weights · google/gemma-4-31b-it |
| Quantization (Q4_K_M) | OSS | llama.cpp standard quantization · already deployed in live API |
| Inference runtime (llama.cpp / Tundra) | OSS | llama.cpp; current Modulum API runs Tundra backend (the 503 we saw was here) |
| Speculative decoding scaffold | OSS | llama.cpp draft_n · vLLM speculative decoding · TensorRT-LLM |
| M5 sparsity mask generator | PROPRIETARY | Hypernym M5 patent · core IP · the function that takes query → M_shared |
| Substrate policy registry | PROPRIETARY | Hypernym Modulum patent · per-domain policy bundle; tradeable artifact (R19 Marketplace outlier) |
| Delta mask generator ΔM_k | PROPRIETARY | Hypernym R19 R2 architectural contribution · horizon-specific corrections to M_shared |
| Attention kernel modification | Hybrid | Modified llama.cpp kernel to accept Modulum mask as input; modification is OSS-pluggable but Modulum mask passing is Hypernym |
| Benchmark runner (BABILong) | OSS | github.com/booydar/babilong · standard public benchmark |
| Frontier-comparison data | Open | WenTuoAI MRCR v2 measurements · Awesome Agents leaderboard · published frontier results |
| HyperRemember persistent memory | PROPRIETARY | Hypernym own API · context grows / scales smarter over time |
| Retention Receipt API | PROPRIETARY | Hypernym original product · R18/R19 panel convergent |
Of the ~10 components, only 5 are proprietary — and they're all tied to existing Hypernym patents (M5 + Modulum) plus the R19 R2 Substrate Delta Mask architectural contribution. The remaining 5 are composable open-source primitives. This is a compilation problem, not a from-scratch invention problem. Most of the build effort is integration glue; the IP is concentrated in ~3 algorithm modules.
Phases are sequential where dependencies exist; phases 1-3 can run in parallel with phases 4-6.
Reproduce the published BABILong +9pp result on Modulum API. Document the exact M_shared generator behavior. Lock the baseline. Falsifier: if reproduction fails, the entire build hypothesis breaks.
Engineering effort: 1 person, 2 weeks. Lift: low (we have the API + the recipe).
Implement ΔM_k as horizon-specific corrections to M_shared. Three sub-variants to evaluate: additive · multiplicative · position-encoded. Measure substrate-overlap empirically — what fraction of MTP heads actually need deltas?
Engineering effort: 1-2 senior engineers, 2-3 weeks. The IP innovation lives here.
Modify llama.cpp speculative decoder to call M_shared + ΔM_k instead of dense attention per MTP head. Use existing draft_n infrastructure; replace the draft attention path with Modulum-conditioned masks.
Engineering effort: 1-2 systems engineers, 2-3 weeks. The integration is OSS-touching.
Run BABILong qa1 + qa2-qa20 at 32k/64k/128k with delta-mask architecture. Target: ≥8× decode at 128k Q4_K_M; ≤2% qa1 regression. Build the Rejection-Position Profiler (per-head acceptance rate) as part of the measurement harness.
Engineering effort: 1 person, 2 weeks. Direct extension of Phase 1 work.
Add 99.9% uptime SLA infrastructure (status page, kickback receipts, error-budget tracking). The Tundra 503 we saw needs reliability ops — Phase 5 closes this. Deploy to `gemma4-mtp.hypernym.ai` as a separate endpoint until Phase 6 ships.
Engineering effort: 1 SRE, 2-3 weeks. Standard service-reliability work.
Expose Retention Receipts (per-head acceptance + depth-band confidence + dropped-evidence risk) as a structured response field. Customer-facing diagnostic. Codex R19 R2 outlier productized.
Engineering effort: 1 senior, 2-3 weeks. Wraps the data already collected in Phase 4.
| OSS Component | Why we use it | License |
|---|---|---|
llama.cpp | Existing inference runtime (Tundra). Built-in speculative decoding with draft_n support. C++ for kernel-level performance. | MIT |
vLLM | Alternative production runtime if llama.cpp Tundra reliability issues persist. PagedAttention + speculative decoding. | Apache 2.0 |
booydar/babilong | Standard public benchmark. Repo includes the reproduction recipe. We already validate against it. | MIT |
transformers (Hugging Face) | Reference implementation for tokenization, chat templates, base-model loading. | Apache 2.0 |
Gemma 4 31B weights | Google open-weights release. Foundation we apply M5 to. | Gemma Terms of Use (permissive) |
| Medusa / EAGLE / Speculative-Decoding research code | Reference implementations for multi-token prediction architectures we can compose with. | Mostly MIT/Apache |
Hypernym's competitive advantage is not in rewriting llama.cpp or inventing speculative decoding from scratch. It's in the specific algorithmic glue — M5 mask generation, Substrate Delta Masks, Substrate Policies — that we apply on top. We compose proven OSS primitives, wrap them with patented Hypernym IP, and ship a production system that frontier labs literally cannot replicate without licensing our patents.
Three components form Hypernym's proprietary moat. Each is patented (M5/Modulum) or patentable (Substrate Delta Masks is R19 R2 novel):
query → M_shared. This is the attention-is-noise-reduction kernel. Patent-protected per the Modulum filing. Without this function, the entire compound speedup disappears.Within 30 days of build kickoff: file provisional patent for Substrate Delta Masks. Within 90 days: file patent for the Substrate Policy Registry + auction mechanism. The R19 R2 contribution is novel enough to support both. Patent filings should reference the BABILong benchmark validation as prior-art demonstration.
Each phase has a clear pass/kill gate. The build either delivers the compound speedup or it falsifies the architecture cleanly.
Live API today: +2/+6/+9pp at 32k/64k/128k vs vanilla Gemma. Phase 4 must show same retention curve under delta-mask architecture. Quality regression target ≤2pp.
Live validation probe running concurrent with this doc.
Same Gemma 4 31B Q4_K_M, same hardware. Measure tokens/sec at 32k/64k/128k context with naive MTP vs Substrate Delta Masks. Target ≥8× compound at 128k.
Phase 4 measures per-MTP-head draft acceptance rate. Target: head 1 ≥95%, head 4 ≥70%, head 8 ≥40%. Below these = compound architecture is leaking; tune ΔM_k.
| Bucket | Estimate | Notes |
|---|---|---|
| 2 senior ML engineers · 12 weeks @ $400/hr | $48,000 / engineer · ~$96K total | Substrate Delta Mask generator (proprietary IP development) |
| 1 systems engineer · 8 weeks | ~$32K | llama.cpp kernel integration; Phase 3 + Phase 5 reliability |
| 1 SRE · 4 weeks | ~$16K | Production hardening; 99.9% uptime ops; status page |
| GPU compute · benchmark + experiments | ~$15-25K | BABILong runs at 128k are GPU-expensive; needs A100/H100 hours |
| Patent filings (2 provisional, 1 full) | ~$20K | Substrate Delta Masks + Substrate Policy Registry within 30 + 90 days |
| Total all-in | ~$180K-220K | (Original $80K estimate undercounted patent + GPU compute) |
Revised total ~$180-220K. Still well under the partnership-required dollar amount — buildable internally with existing Hypernym team if 2 engineers can dedicate 12 weeks. Material cost of the build is GPU benchmarks + patent filings, not engineering time (which Hypernym has).
This entire build is feasible without a frontier-lab JV. Inference-time-only Substrate Delta Masks at N=2 achieves the 6× compound that's structurally sufficient for product-market fit. Google JV would extend to N=4-8 with trained-in defaults, but it's optional, not required. Hypernym can ship Modulum × MTP entirely on its own, controlling the IP, controlling the timeline.
Real BABILong qa1 probe against live Modulum API. n=13 samples across 32k/64k/128k. Results below are honest and unedited.
| Length | Probe result | Paper claim | Verdict |
|---|---|---|---|
| 32k | 5/5 = 100% | 89% Modulum (paper) | Validates · exceeds paper on n=5 |
| 64k | 0/5 (4 timeouts + 1 server-500 + 1 wrong) | 80% Modulum (paper) | Cannot validate — API reliability |
| 128k | 0/3 (2 timeouts + 1 server-500) | 69% Modulum (paper) | Cannot validate — API reliability |
5 of 5 BABILong qa1 samples returned correct answers at 32k context. The published 89% Modulum claim is validated (probe exceeded on this small sample). Average latency ~88 seconds per call. Foundation is real; Phase 1 baseline reproduction can begin Day 1.
At 64k and 128k, the Tundra backend on :8090 (same one we saw return 503 earlier this session) returns timeouts (~121s exceeds my 120s timeout) or HTTP 500 Internal Server Errors. Only ONE 64k call completed in time (sample index 3) and it returned "The text does not provide a location for a person named Sandra" — a hallucinated refusal, not a missed retrieval. The architecture isn't being tested at these lengths; the infrastructure can't sustain the workload.
This is a Phase 5 critical-path issue. Production hardening (99.9% uptime SLA + status page + reliability ops) must happen concurrent with Phase 1 baseline reproduction, not after. Without Phase 5, customers can't validate the longer-context claims independently.
What this means for distribution:
Honest framing for sending externally: the foundational architectural claim (Modulum solves lost-in-the-middle) holds at the length where the API is currently stable. The published benchmark numbers come from Hypernym internal infrastructure. Customer-side reproducibility at 64k+ is gated on Phase 5 reliability work, which the build proposal sequences explicitly. Be transparent about this with Chris and any external sharing — overclaiming production-readiness at 128k could erode trust if customers attempt their own probe and hit the same timeouts I did.
Send to:
Do NOT send to: