dspy
dspy copied to clipboard
Deployment of a compiled program
Currently the compiled programs are not async and hence are not efficient to serve using a python server. It would be useful to merge the PRs aiming to add async across the dspy library.
This could also involve adding nurseries in order to await an ensemble of requests simultaneously.
Thanks @sutyum. Is the main target here serving queries in parallel?
Currently we do this with threading; DSPy is thread-safe. Does async offer additional benefits for you?
Serving programs
Threads vs Asyncio
Given that LM programs spend most of their execution time waiting for responses from other machines. They are IO heavy rather than being compute heavy. Async IO tends to perform particularly better in scenarios where a large chunk of execution time is spent waiting. Rather than busy waiting the async executor can carry out other tasks in the meantime. There are also limits on how many OS threads can be created on a given CPU, which is far less than the requests one could serve if each request only initiated a green thread (asyncio).
Compilation vs Execution
Also worth considering the case of LM programs that run very long (hours, days, months) - such sort of scenarios would benefit from other forms of distributed execution. For instance, an agent orchestration project - SuperAGI uses a message broker to break apart the LM call DAG into a workflow with each call happening in a distributed manner.
Bring your own executor
It seems we are still on a look out for an flexible execution model for such compile programs. Just putting my thoughts here in order to continue discussion on this open question.
Do we need this? :
flowchart LR
D[Dspy program] --> C[DAG of compiled programs]
C --> E[Bring your own executor]
One idea that was brought up recently was to compile with dspy and execute with a separate executor system (such as langchain). This sort of approach could be useful to keep dspy focused on LM programming primitives and constructs rather than the various choices one can make for execution.
This sounds super cool, similar to #338 I think the broader question is "how does DSPy fit into the productionization workflow" and something we can think more about to come up with an elegant approach.
Posting here so I get notified of updates. I'd be interested in getting compilation to run on something like Hamilton.
@CyrusOfEden Could sglang as the executor be all that we need?
@sutyum how do you imagine that working?
@CyrusOfEden Could sglang as the executor be all that we need?
That doesn't sound all that useful to me.
Deploying a "compiled" dspy program to me requires publishing a graph comprised of the optimized prompts generated. Then you can take that and convert it into whatever framework you want.
@CyrusOfEden Could sglang as the executor be all that we need?
That doesn't sound all that useful to me.
Deploying a "compiled" dspy program to me requires publishing a graph comprised of the optimized prompts generated. Then you can take that and convert it into whatever framework you want.
so you just take compiled prompts from dspy and run them via for example openai lib ?
@CyrusOfEden Could sglang as the executor be all that we need?
That doesn't sound all that useful to me. Deploying a "compiled" dspy program to me requires publishing a graph comprised of the optimized prompts generated. Then you can take that and convert it into whatever framework you want.
so you just take compiled prompts from dspy and run them via for example openai lib ?
As a first target that would be great!
so langchain is the way forward. also when we say a graph how about using this https://topoteretes.github.io/cognee/
cognee
Deterministic LLMs Outputs for AI Engineers
Open-source framework for loading and structuring
LLM context to create accurate and explainable AI
solutions using knowledge graphs and vector stores
Serving programs
Threads vs Asyncio
Given that LM programs spend most of their execution time waiting for responses from other machines. They are IO heavy rather than being compute heavy. Async IO tends to perform particularly better in scenarios where a large chunk of execution time is spent waiting. Rather than busy waiting the async executor can carry out other tasks in the meantime. There are also limits on how many OS threads can be created on a given CPU, which is far less than the requests one could serve if each request only initiated a green thread (asyncio).
Compilation vs Execution
Also worth considering the case of LM programs that run very long (hours, days, months) - such sort of scenarios would benefit from other forms of distributed execution. For instance, an agent orchestration project - SuperAGI uses a message broker to break apart the LM call DAG into a workflow with each call happening in a distributed manner.
Bring your own executor
It seems we are still on a look out for an flexible execution model for such compile programs. Just putting my thoughts here in order to continue discussion on this open question.
Do we need this? :
flowchart LR D[Dspy program] --> C[DAG of compiled programs] C --> E[Bring your own executor]
One idea that was brought up recently was to compile with dspy and execute with a separate executor system (such as langchain). This sort of approach could be useful to keep dspy focused on LM programming primitives and constructs rather than the various choices one can make for execution.
Here's how LangChainPredict and LangChainModule could be enhanced to support streaming and tracing:
Streaming Support
class LangChainPredict(Predict):
def forward(self, **kwargs):
stream_output = kwargs.pop("stream_output", False)
if stream_output:
# Pass streaming flag to LangChain
output = self.langchain_llm.invoke(prompt, streaming=True)
return StreamedPrediction(output, signature=signature)
else:
output = self.langchain_llm.invoke(prompt)
return Prediction.from_completions(output, signature=signature)
Changes:
- Add
stream_output
argument toforward()
- Pass
streaming=True
to LangChain LLM whenstream_output
is set - Return
StreamedPrediction
instead ofPrediction
for streaming output
Tracing Support
class LangChainPredict(Predict):
def forward(self, **kwargs):
enable_tracing = kwargs.pop("enable_tracing", False)
if enable_tracing:
# Enable tracing in LangChain
self.langchain_llm.set_tracing(True)
output = self.langchain_llm.invoke(prompt)
if enable_tracing:
# Access and log the trace
trace = self.langchain_llm.get_trace()
logger.debug(f"LangChain Trace: {trace}")
return Prediction.from_completions(output, signature=signature)
Changes:
- Add
enable_tracing
argument toforward()
- Enable tracing on LangChain LLM when
enable_tracing
is set - Retrieve and log the trace after invoking the LLM
The LangChainModule
class can expose these same options and pass them through to its underlying LangChainPredict
instances.
With these enhancements, DSPy programs using LangChain components will be able to leverage streaming and tracing capabilities, enabling better observability and interactivity in production deployments.
so langchain is the way forward
I think it's just 'a' way. not the way. ;)
@CyrusOfEden Could sglang as the executor be all that we need?
That doesn't sound all that useful to me. Deploying a "compiled" dspy program to me requires publishing a graph comprised of the optimized prompts generated. Then you can take that and convert it into whatever framework you want.
so you just take compiled prompts from dspy and run them via for example openai lib ?
Dspy supports several other objects in its graphs which I think makes this a little more tricky. How do you encapsulate a retrieval model in the compiled prompts, for example?
Posting here so I get notified on updates! I would love if we could get compilation running on something like Dagster
@sarora-roivant wanna send me a DM on LinkedIn? URL in bio
@sarora-roivant wanna send me a DM on LinkedIn? URL in bio Just sent you a connection request