LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

Optimize for OpenAI Prompt Caching: Restructure entity extraction prompts for 50% cost reduction and faster indexing

Open adorosario opened this issue 1 month ago • 2 comments

Summary

OpenAI introduced automatic prompt caching in October 2024 for GPT-4o, GPT-4o-mini, o1-preview, and o1-mini models. This feature provides a 50% discount on cached prompt tokens and faster processing times for prompts longer than 1024 tokens.

However, LightRAG's current prompt structure prevents effective caching during indexing, missing a significant opportunity to reduce costs and improve indexing latency.

The Problem

Current Prompt Structure

In lightrag/operate.py:2807-2820, the entity extraction system prompt embeds variable content (input_text) directly into the system message:

entity_extraction_system_prompt = PROMPTS[
    "entity_extraction_system_prompt"
].format(**{**context_base, "input_text": content})

This creates a system prompt that looks like:

---Role--- (static, ~100 tokens)
---Instructions--- (static, ~400 tokens)  
---Examples--- (static, ~800 tokens)
---Real Data to be Processed---
<Input>
Entity_types: [static during indexing run]
Text:

{input_text} ← THIS CHANGES FOR EVERY CHUNK ❌


### Why This Prevents Caching

OpenAI's prompt caching works by caching the **longest shared prefix** of prompts. Since `input_text` is embedded at the end of the system prompt, every chunk creates a completely different system prompt string. There is no shared prefix across chunks, so **nothing gets cached**.

### Reference

From the prompt template in `lightrag/prompt.py:11-69`:

```python
PROMPTS["entity_extraction_system_prompt"] = """---Role---
...
---Real Data to be Processed---
<Input>
Entity_types: [{entity_types}]
Text:

{input_text} # Variable content embedded in system prompt

"""

The Solution

Restructure Prompts for Caching

To leverage OpenAI's automatic prompt caching, the prompts should be restructured:

Optimal structure:

  • System message: Static instructions + examples + entity types (~1300 tokens, cacheable!)
  • User message: Just the variable input_text (~150 tokens per chunk)

This would allow the ~1300 token system message to be cached and reused for ALL chunks during an indexing run, with only the small user message varying.

Proposed Changes

  1. Split the system prompt template (lightrag/prompt.py):

    • Remove {input_text} from entity_extraction_system_prompt
    • Keep only the static instructions, examples, and entity types
  2. Modify the user prompt template:

    • Make entity_extraction_user_prompt contain the variable input_text
  3. Update the extraction logic (lightrag/operate.py):

    • Format system prompt once (without input_text)
    • Format user prompt with input_text for each chunk

Example Restructured Template

PROMPTS["entity_extraction_system_prompt"] = """---Role---
You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the input text.

---Instructions---
[... all the static instructions ...]

---Examples---
[... all the examples ...]

---Entity Types---
Entity_types: [{entity_types}]
"""

PROMPTS["entity_extraction_user_prompt"] = """---Task---
Extract entities and relationships from the following input text.

---Input Text---

{input_text}


---Output---
"""

Expected Impact

Cost Savings

For a typical indexing run of 8,000 chunks:

  • Current: ~1,450 tokens × 8,000 chunks = ~11.6M prompt tokens (all counted as new)
  • With caching: ~1,450 tokens (first chunk) + ~150 tokens × 7,999 chunks = ~1.3M new prompt tokens + ~10.4M cached tokens (50% discount)
  • Result: ~45% cost reduction on prompt tokens during indexing

Latency Improvements

  • Cached prompt tokens process significantly faster than new tokens
  • Reduces overall indexing time, especially for large document collections
  • More responsive during bulk upload operations

Automatic Activation

OpenAI's prompt caching is automatic for prompts > 1024 tokens:

  • No API changes required beyond restructuring prompts
  • Works with existing GPT-4o, GPT-4o-mini, o1-preview, o1-mini models
  • Cache persists 5-10 minutes (max 1 hour), perfect for batch indexing

References

Additional Benefits

This optimization would:

  • ✅ Reduce indexing costs by ~45% for OpenAI users
  • ✅ Improve indexing latency significantly
  • ✅ Make LightRAG more cost-effective for large-scale deployments
  • ✅ Require minimal code changes
  • ✅ Work automatically without user configuration

Affected Files

  • lightrag/prompt.py - Prompt templates
  • lightrag/operate.py - Entity extraction logic (lines ~2807-2850)

Thank you for considering this optimization! Happy to provide more details or assist with implementation if helpful.

adorosario avatar Nov 14 '25 01:11 adorosario

Excellent analysis and proposal! This optimization is exactly what's needed for production deployments. The 50% cost reduction through prompt caching is significant, especially for large-scale indexing operations.

Technical observations:

  1. Cache Hit Rate: With the proposed structure, the ~1300 token system message will have 100% cache hit rate across all chunks (after the first), which is optimal.

  2. Latency improvements: Beyond cost, cached prompts typically show 2-3x faster response times, which will substantially speed up indexing.

  3. Implementation considerations:

    • Ensure the system message stays identical across calls to maximize cache hits
    • Consider adding cache control headers to explicitly mark cacheable content
    • Monitor cache hit rates in production to validate the optimization
  4. Backward compatibility: The proposed changes maintain the same output format, so this should be a drop-in replacement.

Suggested additions:

  • Add a configuration flag to enable/disable caching for users on different API tiers
  • Include benchmark results showing before/after timing
  • Document the minimum token threshold (1024) for caching to trigger

I'd be happy to help implement this or review a PR. This is a high-value optimization that will benefit all users.

shanto12 avatar Nov 15 '25 14:11 shanto12

Correct me if I'm wrong here but it looks like the caching concern is overstated. OpenAI’s prompt caching doesn’t require system messages to be identical, only that the token prefix is identical.

Cache hits are only possible for exact prefix matches within a prompt See https://platform.openai.com/docs/guides/prompt-caching#extended-prompt-cache-retention

Since LightRAG puts all static instructions and examples before the {input_text}, the shared prefix is still the entire static block, and that portion will be cached even though the variable text is in the same message. Splitting the prompt into separate system/user messages is cleaner but not required for caching under OpenAI’s guidelines.

xtfocus avatar Nov 25 '25 03:11 xtfocus