Prompt Compression

Reduce token usage and AI costs by compressing prompts before they reach your LLM provider. Three methods available: TOON for JSON inputs, Compact for any text, and Combined for maximum compression.

Link to section: What is Prompt Compression?What is Prompt Compression?

Prompt compression reduces the token count of your prompts before they reach your LLM provider, helping you save on API costs without sacrificing quality. LockLLM offers three compression methods that work in both the scan endpoint and proxy mode.

Key features:

  • Three methods: TOON (free, JSON-only, instant), Compact (any text, $0.0001 per use), and Combined (TOON + Compact, $0.0001 per use)
  • Opt-in via the X-LockLLM-Compression header (disabled by default)
  • Security scanning always runs on the original uncompressed text
  • Fail-open design: compression failures never block your request
  • Results cached for 30 minutes to avoid redundant processing

Link to section: Compression MethodsCompression Methods

Link to section: TOON (Token-Oriented Object Notation)TOON (Token-Oriented Object Notation)

Converts JSON data to a compact, token-efficient format that removes redundant syntax (braces, quotes, repeated keys) while remaining readable by LLMs.

  • Input: Valid JSON only (returns original text unchanged for non-JSON)
  • Cost: FREE
  • Latency: Instant (local transformation, no external service)
  • Savings: 30-60% token reduction for JSON data
  • Best for: RAG pipelines with structured data, API responses, database records, configuration objects

How it works:

TOON removes structural overhead from JSON while preserving all data:

JSON input:
{"users": [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]}

TOON output:
users[2]{name,age}:
Alice,30
Bob,25

The LLM can still understand the data, but uses significantly fewer tokens to process it.

Link to section: Compact (ML-based)Compact (ML-based)

Uses advanced ML-based token classification to identify and remove non-essential tokens from any text while preserving meaning and faithfulness.

  • Input: Any text - natural language, code, structured data, mixed content
  • Cost: $0.0001 per use
  • Latency: Up to 5 seconds (external ML service with fail-open timeout)
  • Savings: 30-70% token reduction depending on compression rate
  • Best for: Long documents, verbose prompts, cost optimization at scale

Configurable compression rate:

  • 0.3 - Aggressive compression (more tokens removed)
  • 0.5 - Balanced (default)
  • 0.7 - Conservative (preserves more detail)

Set the rate via the X-LockLLM-Compression-Rate header.

Link to section: Combined (TOON + Compact)Combined (TOON + Compact)

Chains both compression methods for maximum token reduction. Applies TOON first to convert JSON to compact notation, then runs Compact on the result for further compression. For non-JSON input, TOON is skipped and only Compact runs.

  • Input: Any text (JSON gets double-compressed, non-JSON runs Compact only)
  • Cost: $0.0001 per use (same as Compact, since TOON is free)
  • Latency: Same as Compact (up to 5 seconds)
  • Savings: Greater than either method alone for JSON data
  • Best for: Large structured JSON contexts in RAG pipelines, JSON data where you want maximum token reduction

The compression rate setting applies to the Compact step of the combined process.

Link to section: ConfigurationConfiguration

Link to section: HeadersHeaders

HeaderValuesDefaultDescription
X-LockLLM-Compressiontoon, compact, combinedNot set (disabled)Compression method to apply
X-LockLLM-Compression-Rate0.3 - 0.70.5Compression aggressiveness for compact and combined methods (lower = more compression)

Link to section: How It Works (Scan Endpoint)How It Works (Scan Endpoint)

  1. Send request with X-LockLLM-Compression header
  2. LockLLM scans the prompt for security threats (using the original text)
  3. The prompt is compressed using the selected method
  4. Response includes compression_result with the compressed text and metadata
  5. Use the compressed text in your own LLM calls

Link to section: How It Works (Proxy Mode)How It Works (Proxy Mode)

  1. Send request with X-LockLLM-Compression header
  2. LockLLM scans the original prompt for security threats
  3. After all security checks pass, the prompt is compressed
  4. The compressed prompt is forwarded to your AI provider
  5. Response headers indicate compression method and ratio
  6. You pay less for tokens on the upstream provider

Important: Security scanning always happens on the original uncompressed text. Compression is applied only after scanning, ensuring attackers cannot use compression to bypass security detection.

Link to section: ExamplesExamples

Link to section: Scan Endpoint - TOONScan Endpoint - TOON

Compress JSON data for free using TOON format:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: toon" \
  -d '{
    "input": "{\"users\": [{\"name\": \"Alice\", \"age\": 30}, {\"name\": \"Bob\", \"age\": 25}]}"
  }'

Link to section: Scan Endpoint - CompactScan Endpoint - Compact

Compress any text using ML-based compression:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: compact" \
  -H "X-LockLLM-Compression-Rate: 0.4" \
  -d '{
    "input": "Please provide a detailed and comprehensive analysis of the following document including all relevant sections and subsections..."
  }'

Link to section: Scan Endpoint - CombinedScan Endpoint - Combined

Apply TOON first, then Compact for maximum compression:

curl -X POST https://api.lockllm.com/v1/scan \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-LockLLM-Compression: combined" \
  -H "X-LockLLM-Compression-Rate: 0.5" \
  -d '{
    "input": "{\"users\": [{\"name\": \"Alice\", \"age\": 30, \"email\": \"[email protected]\"}, {\"name\": \"Bob\", \"age\": 25, \"email\": \"[email protected]\"}]}"
  }'

Link to section: Scan Response (with compression)Scan Response (with compression)

{
  "request_id": "req_abc123",
  "safe": true,
  "label": 0,
  "confidence": 95,
  "injection": 5,
  "sensitivity": "medium",
  "compression_result": {
    "method": "toon",
    "compressed_input": "users[2]{name,age}:\nAlice,30\nBob,25",
    "original_length": 62,
    "compressed_length": 35,
    "compression_ratio": 0.56
  },
  "usage": {
    "requests": 1,
    "input_chars": 62
  }
}

Link to section: Proxy Mode - JavaScript/TypeScriptProxy Mode - JavaScript/TypeScript

import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.LOCKLLM_API_KEY,
  baseURL: 'https://api.lockllm.com/v1/proxy/openai',
  defaultHeaders: {
    'X-LockLLM-Compression': 'compact',
    'X-LockLLM-Compression-Rate': '0.5'
  }
})

// Prompts are automatically compressed before reaching OpenAI
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: longDocument }]
})

// Check compression headers in raw response:
// X-LockLLM-Compression-Method: compact
// X-LockLLM-Compression-Applied: true
// X-LockLLM-Compression-Ratio: 0.4500

Link to section: Proxy Mode - PythonProxy Mode - Python

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get('LOCKLLM_API_KEY'),
    base_url='https://api.lockllm.com/v1/proxy/openai',
    default_headers={
        'X-LockLLM-Compression': 'compact',
        'X-LockLLM-Compression-Rate': '0.5'
    }
)

# Prompts are automatically compressed before reaching OpenAI
response = client.chat.completions.create(
    model='gpt-4',
    messages=[{'role': 'user', 'content': long_document}]
)

Link to section: Proxy Mode - TOON for JSON DataProxy Mode - TOON for JSON Data

import OpenAI from 'openai'

const openai = new OpenAI({
  apiKey: process.env.LOCKLLM_API_KEY,
  baseURL: 'https://api.lockllm.com/v1/proxy/openai',
  defaultHeaders: {
    'X-LockLLM-Compression': 'toon'  // Free JSON compression
  }
})

// JSON data in prompt is automatically compressed to TOON format
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{
    role: 'user',
    content: JSON.stringify(retrievedDocuments)
  }]
})

Link to section: Response Fields (Scan Endpoint)Response Fields (Scan Endpoint)

The compression_result object is included when compression is enabled and successfully applied:

FieldTypeDescription
compression_result.methodstringCompression method used: "toon", "compact", or "combined"
compression_result.compressed_inputstringThe compressed version of the input text
compression_result.original_lengthnumberCharacter count of the original input
compression_result.compressed_lengthnumberCharacter count after compression
compression_result.compression_rationumberRatio of compressed to original length (0-1, lower = better compression)

Note: compression_result is omitted when compression is disabled, when the input cannot be compressed (e.g., non-JSON input with TOON), or when compression fails.

Link to section: Response Headers (Proxy Mode)Response Headers (Proxy Mode)

When prompt compression is enabled, these headers are added to proxy responses:

HeaderDescription
X-LockLLM-Compression-MethodCompression method used: "toon", "compact", or "combined"
X-LockLLM-Compression-Applied"true" or "false" - whether compression was successfully applied
X-LockLLM-Compression-RatioCompression ratio (only included when compression was applied)

Link to section: PricingPricing

MethodCostWhen Charged
TOONFREENever
Compact$0.0001 per useEvery time compact compression is used
Combined$0.0001 per useEvery time combined compression is used
  • TOON is free because it runs locally with no external service
  • Compact and Combined are charged on every use because they run on dedicated ML infrastructure
  • Combined costs the same as Compact because the TOON step is free
  • Compression charges apply regardless of whether the prompt is safe or unsafe
  • Compression charges are in addition to any scan detection or routing fees

Link to section: When to Use Each MethodWhen to Use Each Method

Use TOON when:

  • Your prompts contain JSON data (RAG contexts, API responses, database records)
  • You want zero-cost compression
  • You need instant compression with no added latency
  • Your data has uniform arrays of objects (where TOON savings are greatest)

Use Compact when:

  • Your prompts contain natural language text
  • You are sending long documents or verbose prompts
  • You want maximum token savings on any input type
  • The $0.0001 per-use cost is justified by upstream token savings

Use Combined when:

  • Your prompts contain JSON data and you want maximum compression beyond what TOON alone achieves
  • You have large structured JSON contexts in RAG pipelines
  • The additional $0.0001 cost is justified by further savings on upstream tokens
  • You want the benefits of both structural optimization (TOON) and content-level compression (Compact)

Skip compression when:

  • Prompts are already short (under 200 characters)
  • Token costs are not a concern
  • You need exact text preservation (though both methods maintain semantic meaning)

Link to section: Combining with Other FeaturesCombining with Other Features

Compression works alongside all other LockLLM features:

  • Security scanning: Always runs on original uncompressed text
  • Custom policies: Evaluated on original text before compression
  • PII detection and redaction: PII is detected and optionally stripped first, then compression is applied to the (potentially redacted) text
  • Smart routing: Routing decisions are made independently of compression
  • Abuse detection: Evaluated on original text
  • Response caching: Cache keys include compression settings

Processing order in proxy mode:

Security scan -> PII redaction -> Prompt compression -> Forward to provider

Link to section: CachingCaching

Compression results are cached for 30 minutes. Identical inputs with the same compression method and rate return cached results, avoiding redundant processing. This is especially beneficial for the Compact method which involves an external ML service call.

Cache keys include the compression method, rate, and a hash of the input text. Different rates or methods produce separate cache entries.

Link to section: FAQFAQ

Link to section: Does compression affect security scanning?Does compression affect security scanning?

No. Security scanning always runs on the original uncompressed text. Compression is applied after all security checks pass. This ensures attackers cannot use compression to bypass security detection.

Link to section: What happens if compression fails?What happens if compression fails?

Compression uses fail-open design. If the compression service times out (5-second limit for Compact) or encounters an error, the original uncompressed text is forwarded to your provider. Your request is never blocked due to compression failures.

Link to section: Can I use TOON on non-JSON text?Can I use TOON on non-JSON text?

TOON only works on valid JSON input. If your input is not valid JSON, the original text is returned unchanged and the compression_result field is omitted from the response. No error is returned - TOON gracefully handles non-JSON input.

Link to section: Is compression applied to system prompts?Is compression applied to system prompts?

In proxy mode, compression is applied only to the last user message in the conversation. System prompts and assistant messages are not compressed.

Link to section: How much can I save with compression?How much can I save with compression?

Savings depend on your input type and the compression method:

  • TOON on JSON: 30-60% token reduction (FREE)
  • Compact on natural language: 30-70% token reduction depending on the rate setting ($0.0001 per use)

For example, if you send 1,000 requests with an average of 2,000 tokens per prompt and Compact achieves 50% compression:

  • Token savings: ~1,000,000 tokens saved
  • Compact cost: $0.10 (1,000 x $0.0001)
  • Upstream savings: varies by model (could be $0.01-$75 per 1M tokens depending on provider)

Link to section: Can I combine TOON and Compact?Can I combine TOON and Compact?

Yes. Use "combined" as the compression method. Combined mode applies TOON first (converting JSON to compact notation), then runs Compact on the TOON output for further token reduction. For non-JSON input, TOON is skipped and only Compact runs. The cost is $0.0001 per use (same as Compact alone).

Link to section: Does compression affect response quality?Does compression affect response quality?

TOON produces a compact notation that LLMs understand effectively - the data structure and values are fully preserved. Compact uses ML-based classification to preserve meaning while removing redundant tokens. If quality is a concern, use a higher compression rate (closer to 0.7) with the Compact method.

Link to section: Is compression available with all providers?Is compression available with all providers?

Yes. Compression works with all 17+ supported providers in proxy mode. In the scan endpoint, compression works independently of any provider.

Updated 8 days ago