lunary
lunary copied to clipboard
Best way to trace cached LLM and tool calls
Hi!
In our application/framework, we have a feature that caches HTTP calls to external APIs (such as OpenAI or various APIs used by tools). It does so by monkey-patching HTTP clients, such as requests
, so that when a request is made it caches the response, and next time it retrieves the response directly from cache.
We want such behavior to be visible in Lunary traces, e.g. when an LLM call is made and the response is taken from cache instead of the API, an additional item is appended to the trace:
Is it possible to achieve this by calling track_event() from the cache-managing function? How would it determine its parent?
Can it eg retrieve the id of the parent from the headers of the http request it's caching? We are using Lunary wrappers for all the entities that generate the http calls we are caching, so there should always be a parent scope id, the question is how to get it to the caching layer
In fact, we'll be spitting the caching logic into a separate package soon, hoping for a wider adoption, so would be great to have the right pattern of integration with Lunary there from the start.
Here is the code where we think we can insert a track_event call: https://github.com/ShoggothAI/motleycrew/blob/main/motleycrew/caching/http_cache.py#L161
Entry point in case you need to run it: https://github.com/ShoggothAI/motleycrew/blob/main/examples/single_llama_index.py
Hi @whimo and @ZmeiGorynych,
I have updated our Python SDK to provide the latest run ID. You can use lunary.run_ctx.get()
to retrieve it.
Regarding the implementation for cached calls on Lunary, we have decided to take a different approach. Instead of adding a child, we are considering displaying it like this:
(Please note that this design is not final)
In the latest version of the SDK, you can pass a metadata dictionary to track_event. To indicate a cached call, you can call track_event(..., metadata={"cache": True})
for the LLM end event. You can try this solution by upgrading lunary-py to version 1.0.10.
With this change, you will be able to see the metadata on the right:
If this approach works for you, we can implement the frontend today with a similar design as shown on the first screenshot above. Please let me know your thoughts on this.
This looks great, thank you! The only limitation I can think of is that we also support Langchain agents and use Lunary natively with them (no track_event calls). How shall we pass the metadata in this case?
Can you pass the metadata where you do the langchain invoke? It will be passed down to lunary automatically
I've added the UI for cached calls:
Please let me know if you can make it work, especially with Langchain.
Thank you! We'll try it and get back to you
Hi @hughcrt, we ran into a problem with this approach which we didn't realize earlier. When an LLM call is made, we don't know beforehand if the cache will be hit, and moreover the agent (or LLM wrapper) has no info on whether the response was taken from cache. And I suspect that even if we could pass this info up the call stack, it would be painful and unreliable to implement it for every agent or LLM interaction library.
So we think that reporting cache hits directly from the caching manager via track_event() would be much more coherent and straightforward to implement. Regarding this,
- We still like the pretty UI you implemented for indicating caching. Is it possible to marry it with this approach, maybe by reporting an event of a special type and merging it with its parent under the hood? Of course, we can still do without it, but your suggestions on which particular event type we should report are very welcome.
- It seems a little hack-y to report the caching event start and immediately report its end via two separate track_event() calls. Again, it's not really a problem, but I believe it would be useful to support zero-time events.
Hi @whimo,
You only need to send the metadata in the track_event corresponding to the llm end
event. You would know at that time if it was cached, right?
We could easily at support for zero-time events, but that would still require two calls: one for llm start
, and one to update this event to say it was a cached event.
In my opinion, it's more straightforward to use llm end for that, especially because cache hits still take some time (0.01s in our example).
What do you think?
You only need to send the metadata in the track_event corresponding to the
llm end
event. You would know at that time if it was cached, right?
Technically we do know that, but I know no simple way to pass this knowledge to the point where we fire the llm-end or tool-end. The callbacks recieve limited data that is framework-specific and AFAIK does not include low-level request info. Also, we will have to manually support this for all variety of tools and LLMs that we cache.
The approach that I propose is much more universal. For example,
| track_event(llm start)
|| <request to e.g. OpenAI API happens> // <- we don't control this part!
||| <cache hit occurs>
|||| track_event(cache start, parent=lunary.run_ctx.get())
|||| track_event(cache end) // would be great to merge these two calls into one
||| <cached response is returned>
|| <cached response is returned> // <- we don't control this part either
| track_event(llm end)
This would work regardless of the framework used and whether the events are fired manually or automatically e.g. via Langchain integration.
Hi @whimo, got it, I think we can something like:
| track_event(llm start)
|| <request to e.g. OpenAI API happens> // <- we don't control this part!
||| <cache hit occurs>
|||| track_event(cache start, parent=lunary.run_ctx.get())
||| <cached response is returned>
|| <cached response is returned> // <- we don't control this part either
| track_event(llm end)
I'll keep you updated when the SDK will be updated with those changes.
You can now use lunary.track_event("llm", "update", lunary.run_ctx.get(), metadata={"cache": True})
. You don't need to update the SDK, we handle everything on the latest version of the server.
Please reopen the issue if you have any problem making it work.
🔥 Looks like what we needed, thank you!
Hi @hughcrt, we ran into one more problem, so I'm reopening this as you instructed.
Callback handlers use their own event queue (defined here). When we call track_event
from the cache manager, it uses Lunary's global queue by default (defined here).
So the start and end events go into the callback handler's queue, and the update events with cache=True go into the global one. It seems like the update events are not linked correctly to their parents because they end up in different queues (everything works if we track the start/end events manually using the global queue).
So we need to figure out how to use same queue as the callback handler for reporting cache events. I see two possible ways:
- To make LunaryCallbackHandler use the global queue (either by reusing it here or by allowing to provide queue as a constructor argument).
- To store the queue in run_ctx so it can be accessed from our cache manager.
Please share your thoughts on this.
Oops, seems like I don't have the rights to reopen the issue, so I just hope you'll recieve a notification 😄
Hi @whimo, I will have a look and get back to you with a solution 🙂
Hi @hughcrt, have you figured anything out? :)
Hi @whimo, sorry the delay. In version 1.0.20, the callback handler now uses the global event queue. I will get back to you in a bit, when the update using context variables will be ready
@whimo, in version 1.0.21
, the event queue is in a context var, so it should solve your problem. Let me know if that works for you 🙂
Hi @hughcrt, everything seems to be working as expected, thank you for your help!