Context Optimization API

Context compression for LLM pipelines

HighSNR compresses long documents to fit your token budget before they reach your LLM โ€” cutting costs without sacrificing answer quality.

โœ• No AI involved ๐Ÿ”’ 0 data retention = Same input โ†’ same output

Where it fits

LLM calls

Compress before you send

Pass a long document and a token budget. Get back only the chunks that matter. Fewer tokens sent โ€” lower cost, faster responses, less hallucination from noise.

RAG Memory & Embeddings

Less vectors, less noise

Before embedding a large corpus, compress documents first. Fewer, higher-quality chunks mean less storage, faster retrieval, and less noise in your vector store.

How it works

โœ•

No AI inside

No model, no black box, no randomness. You know exactly what it does.

=

Deterministic

Same document, same budget, same output. Every time. No model drift, no surprises in production.

๐Ÿ”’

Zero retention

Your documents are never stored, logged, or used for training. Only counters and metadata are kept.

โšก

Fast

Sub-second for most documents. No local model download, no GPU required.

Benchmark

LongBench v1 ยท GPT-4o ยท n=200 per dataset ยท QA F1 score

Evaluated across two multi-hop and single-hop QA datasets. Higher is better.

HotpotQA

Config 50% 60% 70% 80% Full doc
No hint 65.29 66.34 68.08 70.70 โ€”
With query hint 67.28 68.02 69.95 70.96 โ€”
Full context (baseline) โ€” โ€” โ€” โ€” 69.71

At 80% budget, HighSNR beats full-context F1.

Actual token ratio (output / input) โ€” HotpotQA

Target Mean Median Min Max
50% 55.9% 55.4% 41.6% 71.7%
60% 67.9% 67.3% 55.0% 83.9%
70% 79.8% 79.1% 69.5% 99.9%
80% 91.4% 90.8% 81.1% 100.0%

Qasper

Config 50% 60% 70% 80% Full doc
No hint 35.51 38.16 41.36 45.37 โ€”
With query hint 39.87 40.76 42.97 45.21 โ€”
Full context (baseline) โ€” โ€” โ€” โ€” 47.22

At 80% budget, HighSNR retains 96% of full-context F1 on scientific QA.

Actual token ratio (output / input) โ€” Qasper

Target Mean Median Min Max
50% 54.7% 54.4% 37.5% 69.5%
60% 66.4% 66.2% 47.3% 79.7%
70% 78.0% 77.6% 69.2% 92.0%
80% 89.9% 89.5% 79.4% 100.0%

Actual ratios exceed the target because HighSNR never cuts a chunk mid-sentence. Chunks are selected whole โ€” if the next chunk would exceed the budget it is skipped, so the output lands just below (not at) the target. Short documents where a single chunk spans the full budget pull the mean slightly above the target percentage.

Latency

Live API calls ยท 0.5 vCPU / 1 GB ยท n=3,200

< 5k tokens

770 ms median

mean 777 ms

5k โ€“ 10k tokens

1,102 ms median

mean 1,142 ms

10k โ€“ 20k tokens

1,792 ms median

mean 1,833 ms

API

One endpoint. Pass your document and a token budget. Get back the most relevant chunks.

POST /v1/optimize
curl https://api.high-snr.com/v1/optimize \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document": "your long document text...",
    "budget": { "value": 2000 },
    "context_hint": "what is the main finding?"
  }'
Response
{
  "selected_chunks": [
    "Most relevant passage from your document...",
    "Second most relevant passage..."
  ],
  "selected_chunk_indices": [2, 5]
}

document

The full text to compress. Plain string.

budget.value

Max tokens in the output. Integer token budget for the selected chunks.

context_hint

Optional query string. Biases selection toward relevant chunks.