Prompt Compression - LockLLM Documentation

Link to section: What is Prompt Compression?What is Prompt Compression?

Prompt compression reduces the token count of your prompts before they reach your LLM provider, helping you save on API costs without sacrificing quality. LockLLM offers three compression methods that work in both the scan endpoint and proxy mode.

Key features:

Three methods: TOON (free, JSON-only, instant), Compact (any text, $0.0001 per use), and Combined (TOON + Compact, $0.0001 per use)
Opt-in via the X-LockLLM-Compression header (disabled by default)
Security scanning always runs on the original uncompressed text
Fail-open design: compression failures never block your request
Results cached for 30 minutes to avoid redundant processing

Link to section: Compression MethodsCompression Methods

Link to section: TOON (Token-Oriented Object Notation)TOON (Token-Oriented Object Notation)

Converts JSON data to a compact, token-efficient format that removes redundant syntax (braces, quotes, repeated keys) while remaining readable by LLMs.

Input: Valid JSON only (returns original text unchanged for non-JSON)
Cost: FREE
Latency: Instant (local transformation, no external service)
Savings: 30-60% token reduction for JSON data
Best for: RAG pipelines with structured data, API responses, database records, configuration objects

How it works:

TOON removes structural overhead from JSON while preserving all data:

JSON input:
{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}

TOON output:
users[2]{name,age}:
Alice,30
Bob,25

The LLM can still understand the data, but uses significantly fewer tokens to process it.

Link to section: Compact (ML-based)Compact (ML-based)

Uses advanced ML-based token classification to identify and remove non-essential tokens from any text while preserving meaning and faithfulness.

Input: Any text - natural language, code, structured data, mixed content
Cost: $0.0001 per use
Latency: Up to 5 seconds (external ML service with fail-open timeout)
Savings: 30-70% token reduction depending on compression rate
Best for: Long documents, verbose prompts, cost optimization at scale

Configurable compression rate:

0.3 - Aggressive compression (more tokens removed)
0.5 - Balanced (default)
0.7 - Conservative (preserves more detail)

Set the rate via the X-LockLLM-Compression-Rate header.

Link to section: Combined (TOON + Compact)Combined (TOON + Compact)

Chains both compression methods for maximum token reduction. Applies TOON first to convert JSON to compact notation, then runs Compact on the result for further compression. For non-JSON input, TOON is skipped and only Compact runs.

Input: Any text (JSON gets double-compressed, non-JSON runs Compact only)
Cost: $0.0001 per use (same as Compact, since TOON is free)
Latency: Same as Compact (up to 5 seconds)
Savings: Greater than either method alone for JSON data
Best for: Large structured JSON contexts in RAG pipelines, JSON data where you want maximum token reduction

The compression rate setting applies to the Compact step of the combined process.

Link to section: ConfigurationConfiguration

Link to section: HeadersHeaders

Header	Values	Default	Description
`X-LockLLM-Compression`	`toon`, `compact`, `combined`	Not set (disabled)	Compression method to apply
`X-LockLLM-Compression-Rate`	`0.3` - `0.7`	`0.5`	Compression aggressiveness for compact and combined methods (lower = more compression)

Link to section: How It Works (Scan Endpoint)How It Works (Scan Endpoint)

Send request with X-LockLLM-Compression header
LockLLM scans the prompt for security threats (using the original text)
The prompt is compressed using the selected method
Response includes compression_result with the compressed text and metadata
Use the compressed text in your own LLM calls

Link to section: How It Works (Proxy Mode)How It Works (Proxy Mode)

Send request with X-LockLLM-Compression header
LockLLM scans the original prompt for security threats
After all security checks pass, the prompt is compressed
The compressed prompt is forwarded to your AI provider
Response headers indicate compression method and ratio
You pay less for tokens on the upstream provider

Important: Security scanning always happens on the original uncompressed text. Compression is applied only after scanning, ensuring attackers cannot use compression to bypass security detection.

Link to section: ExamplesExamples

Link to section: Scan Endpoint - TOONScan Endpoint - TOON

Compress JSON data for free using TOON format:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: toon" \
  -d '{
    "input": "{\"users\": [{\"name\": \"Alice\", \"age\": 30}, {\"name\": \"Bob\", \"age\": 25}]}"
  }'

Link to section: Scan Endpoint - CompactScan Endpoint - Compact

Compress any text using ML-based compression:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: compact" \
  -H "X-LockLLM-Compression-Rate: 0.4" \
  -d '{
    "input": "Please provide a detailed and comprehensive analysis of the following document including all relevant sections and subsections..."
  }'

Link to section: Scan Endpoint - CombinedScan Endpoint - Combined

Apply TOON first, then Compact for maximum compression:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: combined" \
  -H "X-LockLLM-Compression-Rate: 0.5" \
  -d '{
    "input": "{\"users\": [{\"name\": \"Alice\", \"age\": 30, \"email\": \"[email protected]\"}, {\"name\": \"Bob\", \"age\": 25, \"email\": \"[email protected]\"}]}"
  }'

Link to section: Scan Response (with compression)Scan Response (with compression)

{
  "request_id": "req_abc123",
  "safe": true,
  "label": 0,
  "confidence": 95,
  "injection": 5,
  "sensitivity": "medium",
  "compression_result": {
    "method": "toon",
    "compressed_input": "users[2]{name,age}:\nAlice,30\nBob,25",
    "original_length": 62,
    "compressed_length": 35,
    "compression_ratio": 0.56
  },
  "usage": {
    "requests": 1,
    "input_chars": 62
  }
}

Link to section: Proxy Mode - JavaScript/TypeScriptProxy Mode - JavaScript/TypeScript

import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.LOCKLLM_API_KEY,
  baseURL: 'https://api.lockllm.com/v1/proxy/openai',
  defaultHeaders: {
    'X-LockLLM-Compression': 'compact',
    'X-LockLLM-Compression-Rate': '0.5'
  }
})

// Prompts are automatically compressed before reaching OpenAI
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: longDocument }]
})

// Check compression headers in raw response:
// X-LockLLM-Compression-Method: compact
// X-LockLLM-Compression-Applied: true
// X-LockLLM-Compression-Ratio: 0.4500

Link to section: Proxy Mode - PythonProxy Mode - Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('LOCKLLM_API_KEY'),
    base_url='https://api.lockllm.com/v1/proxy/openai',
    default_headers={
        'X-LockLLM-Compression': 'compact',
        'X-LockLLM-Compression-Rate': '0.5'
    }
)

# Prompts are automatically compressed before reaching OpenAI
response = client.chat.completions.create(
    model='gpt-4',
    messages=[{'role': 'user', 'content': long_document}]
)

Link to section: Proxy Mode - TOON for JSON DataProxy Mode - TOON for JSON Data

import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.LOCKLLM_API_KEY,
  baseURL: 'https://api.lockllm.com/v1/proxy/openai',
  defaultHeaders: {
    'X-LockLLM-Compression': 'toon'  // Free JSON compression
  }
})

// JSON data in prompt is automatically compressed to TOON format
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{
    role: 'user',
    content: JSON.stringify(retrievedDocuments)
  }]
})

Link to section: Response Fields (Scan Endpoint)Response Fields (Scan Endpoint)

The compression_result object is included when compression is enabled and successfully applied:

Field	Type	Description
`compression_result.method`	string	Compression method used: `"toon"`, `"compact"`, or `"combined"`
`compression_result.compressed_input`	string	The compressed version of the input text
`compression_result.original_length`	number	Character count of the original input
`compression_result.compressed_length`	number	Character count after compression
`compression_result.compression_ratio`	number	Ratio of compressed to original length (0-1, lower = better compression)

Note: compression_result is omitted when compression is disabled, when the input cannot be compressed (e.g., non-JSON input with TOON), or when compression fails.

Link to section: Response Headers (Proxy Mode)Response Headers (Proxy Mode)

When prompt compression is enabled, these headers are added to proxy responses:

Header	Description
`X-LockLLM-Compression-Method`	Compression method used: `"toon"`, `"compact"`, or `"combined"`
`X-LockLLM-Compression-Applied`	`"true"` or `"false"` - whether compression was successfully applied
`X-LockLLM-Compression-Ratio`	Compression ratio (only included when compression was applied)

Link to section: PricingPricing

Method	Cost	When Charged
TOON	FREE	Never
Compact	$0.0001 per use	Every time compact compression is used
Combined	$0.0001 per use	Every time combined compression is used

TOON is free because it runs locally with no external service
Compact and Combined are charged on every use because they run on dedicated ML infrastructure
Combined costs the same as Compact because the TOON step is free
Compression charges apply regardless of whether the prompt is safe or unsafe
Compression charges are in addition to any scan detection or routing fees

Link to section: When to Use Each MethodWhen to Use Each Method

Use TOON when:

Your prompts contain JSON data (RAG contexts, API responses, database records)
You want zero-cost compression
You need instant compression with no added latency
Your data has uniform arrays of objects (where TOON savings are greatest)

Use Compact when:

Your prompts contain natural language text
You are sending long documents or verbose prompts
You want maximum token savings on any input type
The $0.0001 per-use cost is justified by upstream token savings

Use Combined when:

Your prompts contain JSON data and you want maximum compression beyond what TOON alone achieves
You have large structured JSON contexts in RAG pipelines
The additional $0.0001 cost is justified by further savings on upstream tokens
You want the benefits of both structural optimization (TOON) and content-level compression (Compact)

Skip compression when:

Prompts are already short (under 200 characters)
Token costs are not a concern
You need exact text preservation (though both methods maintain semantic meaning)

Link to section: Combining with Other FeaturesCombining with Other Features

Compression works alongside all other LockLLM features:

Security scanning: Always runs on original uncompressed text
Custom policies: Evaluated on original text before compression
PII detection and redaction: PII is detected and optionally stripped first, then compression is applied to the (potentially redacted) text
Smart routing: Routing decisions are made independently of compression
Abuse detection: Evaluated on original text
Response caching: Cache keys include compression settings

Processing order in proxy mode:

Security scan -> PII redaction -> Prompt compression -> Forward to provider

Link to section: CachingCaching

Compression results are cached for 30 minutes. Identical inputs with the same compression method and rate return cached results, avoiding redundant processing. This is especially beneficial for the Compact method which involves an external ML service call.

Cache keys include the compression method, rate, and a hash of the input text. Different rates or methods produce separate cache entries.

Link to section: FAQFAQ

Link to section: Does compression affect security scanning?Does compression affect security scanning?

No. Security scanning always runs on the original uncompressed text. Compression is applied after all security checks pass. This ensures attackers cannot use compression to bypass security detection.

Link to section: What happens if compression fails?What happens if compression fails?

Compression uses fail-open design. If the compression service times out (5-second limit for Compact) or encounters an error, the original uncompressed text is forwarded to your provider. Your request is never blocked due to compression failures.

Link to section: Can I use TOON on non-JSON text?Can I use TOON on non-JSON text?

TOON only works on valid JSON input. If your input is not valid JSON, the original text is returned unchanged and the compression_result field is omitted from the response. No error is returned - TOON gracefully handles non-JSON input.

Link to section: Is compression applied to system prompts?Is compression applied to system prompts?

In proxy mode, compression is applied only to the last user message in the conversation. System prompts and assistant messages are not compressed.

Link to section: How much can I save with compression?How much can I save with compression?

Savings depend on your input type and the compression method:

TOON on JSON: 30-60% token reduction (FREE)
Compact on natural language: 30-70% token reduction depending on the rate setting ($0.0001 per use)

For example, if you send 1,000 requests with an average of 2,000 tokens per prompt and Compact achieves 50% compression:

Token savings: ~1,000,000 tokens saved
Compact cost: $0.10 (1,000 x $0.0001)
Upstream savings: varies by model (could be $0.01-$75 per 1M tokens depending on provider)

Link to section: Can I combine TOON and Compact?Can I combine TOON and Compact?

Yes. Use "combined" as the compression method. Combined mode applies TOON first (converting JSON to compact notation), then runs Compact on the TOON output for further token reduction. For non-JSON input, TOON is skipped and only Compact runs. The cost is $0.0001 per use (same as Compact alone).

Link to section: Does compression affect response quality?Does compression affect response quality?

TOON produces a compact notation that LLMs understand effectively - the data structure and values are fully preserved. Compact uses ML-based classification to preserve meaning while removing redundant tokens. If quality is a concern, use a higher compression rate (closer to 0.7) with the Compact method.

Link to section: Is compression available with all providers?Is compression available with all providers?

Yes. Compression works with all 17+ supported providers in proxy mode. In the scan endpoint, compression works independently of any provider.