[Feature] Prompt caching for claude and gemini and complete message output feature in DSPY
What feature would you like to see?
The feature is mostly to improve the prompt caching and debugging for dspy , here is an example script
import dspy
import os
import litellm
litellm.set_verbose = True
os.environ['LITELLM_LOG'] = 'DEBUG'
# Configuration
config = {
"model": {
"max_tokens": 8192,
"temperature": 0.2,
"model_id": "litellm_proxy/openrouter/openai/gpt-4.1",
"api_base": "https://llm.qure.ai/v1/",
"litellm_api_key": os.environ["LITELLM_API_KEY"] # Replace with your actual API key
},
"dspy": {
"legal_agreement_analyzer": {
"system": """You are an expert legal AI assistant specializing in contract analysis.
Given a complex legal agreement text, provide a comprehensive analysis of the document.
Focus on identifying the main purpose, parties involved, and overall structure of the agreement.
Your analysis should be clear, concise, and highlight the most important aspects of the document.""",
"agreement": "The full text of the legal agreement to analyze",
"analysis": "Comprehensive analysis of the legal agreement including main purpose, parties, and key structural elements"
}
}
}
# Simple DSPy Signature for Q&A
class SimpleQA(dspy.Signature):
"""You are an expert legal AI assistant specializing in contract analysis.
Given a complex legal agreement text, provide a comprehensive analysis of the document.
Focus on identifying the main purpose, parties involved, and overall structure of the agreement.
Your analysis should be clear, concise, and highlight the most important aspects of the document."""
context: str = dspy.InputField(desc="The full text of the legal agreement to analyze")
question: str = dspy.InputField(desc="Question about the legal agreement")
answer: str = dspy.OutputField(desc="Comprehensive analysis of the legal agreement")
# Simple Workflow Class
class SimpleWorkflow:
def __init__(self):
self.qa = dspy.Predict(SimpleQA)
self.context = None
def set_context(self, context):
"""Set the large context (like cached system message)"""
self.context = context
print(f"📄 Context set ({len(context)} characters)")
def ask(self, question):
"""Ask a question about the context"""
response = self.qa(context=self.context, question=question)
return response.answer
# Setup Model
lm = dspy.LM(
model=config["model"]["model_id"],
api_base=config["model"]["api_base"],
api_key=config["model"]["litellm_api_key"],
max_tokens=config["model"]["max_tokens"],
temperature=config["model"]["temperature"],
cache=False,
cache_in_memory=False,
)
dspy.configure(lm=lm)
# Initialize workflow
workflow = SimpleWorkflow()
# Large legal agreement context (like original notebook)
legal_text = """Here is the full text of a complex legal agreement""" * 400
# Set the large context (equivalent to cached system message)
workflow.set_context(legal_text)
# Ask questions (replicating original notebook flow)
print("\n🔍 Asking: What are the key terms and conditions in this agreement?")
response1 = workflow.ask("What are the key terms and conditions in this agreement?")
print("Response:", response1)
Based on this code for open ai api while debugging litellm we get something like this
"refusal": null, "role": "assistant", "annotations": null, "audio": null, "function_call": null, "tool_calls": null}}], "created": 1748522625, "model": "openai/gpt-4.1", "object": "chat.completion", "service_tier": null, "system_fingerprint": "fp_ccc6ec921d", "usage": {"completion_tokens": 458, "prompt_tokens": 4245, "total_tokens": 4703, "completion_tokens_details": {"accepted_prediction_tokens": null, "audio_tokens": null, "reasoning_tokens": 0, "rejected_prediction_tokens": null}, "prompt_tokens_details": {"audio_tokens": null, "cached_tokens": 4224}}, "provider": "OpenAI"}
so in this u can see the cached tokens but the same cant be said regarding claude or gemini since that requires messages to be in this format
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
},
],
"cache_control": {"type": "ephemeral"}
},
is there any way to implement this as without this we cant do prompt caching for it and is there any function we can implement so that we can edit the messages list in the end , i have implement a function in the LM base class to see all the messages since the inspect history just gives the messages in text format and debugging is very hard due to this currently i view it by by doing this in the BaseLM class
def inspect_history(self, n: int = 1):
_inspect_history(self.history, n)
def inspect_entire_history(self):
return self.history
and that returns this
'prompt': None,
'messages': [{'role': 'system',
'content': <system msg not adding here since its too long>
{'role': 'user',
'content': <user content>
'kwargs': {},
'response': ModelResponse(id='gen-1748522625-bdVnJTKRoZ358fSVMauZ', created=1748522625, model='litellm_proxy/openai/gpt-4.1', object='chat.completion', system_fingerprint='fp_ccc6ec921d', choices=[Choices(finish_reason='stop', index=0, message= <message>)
'outputs': <the output>
'usage': {'completion_tokens': 458,
'prompt_tokens': 4245,
'total_tokens': 4703,
'completion_tokens_details': CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None),
'prompt_tokens_details': PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=4224, text_tokens=None, image_tokens=None)},
'cost': None,
'timestamp': '2025-05-29T12:43:52.065077',
'uuid': 'c9bbc4ca-bc8b-40f5-91c3-42fb64c8a83a',
'model': 'litellm_proxy/openrouter/openai/gpt-4.1',
'response_model': 'litellm_proxy/openai/gpt-4.1',
'model_type': 'chat'}]
Would you like to contribute?
- [ ] Yes, I'd like to help implement this.
- [x] No, I just want to request it.
Additional Context
https://docs.litellm.ai/docs/completion/prompt_caching
There should be better ways for debugging other than inspect history since i have to see through the inspect history to see where the extra token are coming up and how to optimize it and my idea is something like this we give the message format to dspy and then we get the the entire message json output and then let the user decide if they want to make changes in the json format depending on if there other features in the api call and then give the messages to a dspy call to get the response
messages = dspy.LM (prompt)
< the user can make changes to the json in prompt msg format>
response. = dspy.call (messages)
@Akshay1-6180 Thanks for reporting the issue! I think you are looking for tracing support: https://dspy.ai/tutorials/observability/#tracing. Let us know if this helps with your use case!
Thanks but ideally i would want to know the prompt that i am sending before hand , before sending it to the llm even for tracing , currently editing the messages list is very hard and requires me to make a lot of changes in the dspy package itself locally.
before sending it to the llm even for tracing
This is also supported by MLflow tracing, which covers not only the LLM request/response.