Headroom Cuts LLM Token Costs by Up to 95% — Without Changing Your Answers
The open-source context compression layer strips fat from tool outputs, logs, RAG chunks, and conversation history before they hit the model — preserving accuracy while slashing spend.
Token costs in agentic pipelines don't come from clever reasoning — they come from bloat. A single SRE debugging session can balloon to 65,000 tokens before the LLM sees its first instruction. Headroom attacks that problem directly: it sits between your agent and the LLM provider, compresses everything the model reads, and hands back originals on demand.
The project has hit 18.1k stars on GitHub Trending this week and ships as a Python library, a drop-in proxy, an MCP server, and a one-command agent wrapper.
What Gets Compressed, and How
Headroom's pipeline routes every chunk of incoming context through a ContentRouter that detects the content type, then dispatches to the right compressor:
- SmartCrusher — handles JSON payloads (tool outputs, API responses)
- CodeCompressor — AST-based compression for source files
- Kompress-base — a Hugging Face model for prose, logs, and RAG text
A CacheAligner stage stabilizes prompt prefixes so provider KV caches actually hit across calls — a subtle but meaningful multiplier on savings. The whole thing runs locally; your data never leaves the machine.
The standout mechanism is CCR (reversible compression). Originals are stored locally and never deleted. If the LLM needs the full text — say, to quote a specific line — it calls headroom_retrieve. Compression becomes a retrieval problem rather than a lossy transform.
Real-Workload Numbers
The README publishes savings on four agent workloads:
| Workload | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub issue triage | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
The live demo shows a log compressed from 10,144 to 1,260 tokens with the same FATAL surfaced. The codebase-exploration workload at 47% is the outlier — highly structured, deduplicated source trees compress less aggressively than raw logs.
Accuracy holds up on standard benchmarks: GSM8K math reasoning stays flat at 0.870, TruthfulQA actually ticks up by +0.030, SQuAD v2 QA hits 97% recall at 19% compression, and BFCL tool-calling benchmark scores 97% at 32% compression.
Integration Modes
Headroom is designed to slot into whatever agentic stack you're already running:
# Install
pip install "headroom-ai[all]" # Python
npm install headroom-ai # TypeScript
# Wrap an existing agent (zero code changes)
headroom wrap claude
headroom wrap cursor
headroom wrap codex
# Drop-in OpenAI-compatible proxy
headroom proxy --port 8787
For inline use in LangChain, Agno, or Strands:
from headroom import compress
compressed = compress(messages) # drop-in before your LLM call
The MCP server exposes headroom_compress, headroom_retrieve, and headroom_stats tools, making it accessible to any MCP-compatible client.
There's also a cross-agent memory layer with auto-deduplication shared across Claude, Codex, and Gemini — and headroom learn, which mines failed agent sessions and writes corrections back to CLAUDE.md or AGENTS.md.
Worth Watching
The architecture is solid and the benchmark methodology is reproducible (the README points to python -m h... for verification). The 47–92% range is wide by design — compression ratios depend heavily on input entropy, and the authors show both ends honestly.
For teams running agentic pipelines at scale — think CI bots, SRE automation, or large RAG deployments — shaving 73–92% off context tokens is the kind of infrastructure win that pays for itself quickly. The local-first, reversible approach also sidesteps the usual anxiety about lossy preprocessing breaking downstream reasoning.
Discussion 0
Join the discussion
Sign in with GitHub to comment and vote.
No comments yet
Be the first to weigh in.