lunary Best way to trace cached LLM and tool calls

Hi! In our application/framework, we have a feature that caches HTTP calls to external APIs (such as OpenAI or various APIs used by tools). It does so by monkey-patching HTTP clients, such as requests, so that when a request is made it caches the response, and next time it retrieves the response directly from cache.

We want such behavior to be visible in Lunary traces, e.g. when an LLM call is made and the response is taken from cache instead of the API, an additional item is appended to the trace:

Is it possible to achieve this by calling track_event() from the cache-managing function? How would it determine its parent?

May 07 '24 11:05 whimo

LLM-1262 Best way to trace cached LLM and tool calls

May 07 '24 11:05 linear[bot]

Can it eg retrieve the id of the parent from the headers of the http request it's caching? We are using Lunary wrappers for all the entities that generate the http calls we are caching, so there should always be a parent scope id, the question is how to get it to the caching layer

In fact, we'll be spitting the caching logic into a separate package soon, hoping for a wider adoption, so would be great to have the right pattern of integration with Lunary there from the start.

May 07 '24 13:05 ZmeiGorynych

Here is the code where we think we can insert a track_event call: https://github.com/ShoggothAI/motleycrew/blob/main/motleycrew/caching/http_cache.py#L161

Entry point in case you need to run it: https://github.com/ShoggothAI/motleycrew/blob/main/examples/single_llama_index.py

May 07 '24 17:05 whimo

Hi @whimo and @ZmeiGorynych,

I have updated our Python SDK to provide the latest run ID. You can use lunary.run_ctx.get() to retrieve it.

Regarding the implementation for cached calls on Lunary, we have decided to take a different approach. Instead of adding a child, we are considering displaying it like this: Screenshot 2024-05-07 at 3 54 36 PM (Please note that this design is not final)

In the latest version of the SDK, you can pass a metadata dictionary to track_event. To indicate a cached call, you can call track_event(..., metadata={"cache": True}) for the LLM end event. You can try this solution by upgrading lunary-py to version 1.0.10. With this change, you will be able to see the metadata on the right: Screenshot 2024-05-07 at 4 02 15 PM

If this approach works for you, we can implement the frontend today with a similar design as shown on the first screenshot above. Please let me know your thoughts on this.

May 07 '24 19:05 hughcrt

This looks great, thank you! The only limitation I can think of is that we also support Langchain agents and use Lunary natively with them (no track_event calls). How shall we pass the metadata in this case?

May 07 '24 19:05 whimo

Can you pass the metadata where you do the langchain invoke? It will be passed down to lunary automatically

May 07 '24 19:05 hughcrt

I've added the UI for cached calls:

Screenshot 2024-05-07 at 10 32 29 PM

Please let me know if you can make it work, especially with Langchain.

May 08 '24 01:05 hughcrt

Thank you! We'll try it and get back to you

May 08 '24 06:05 whimo

Hi @hughcrt, we ran into a problem with this approach which we didn't realize earlier. When an LLM call is made, we don't know beforehand if the cache will be hit, and moreover the agent (or LLM wrapper) has no info on whether the response was taken from cache. And I suspect that even if we could pass this info up the call stack, it would be painful and unreliable to implement it for every agent or LLM interaction library.

So we think that reporting cache hits directly from the caching manager via track_event() would be much more coherent and straightforward to implement. Regarding this,

We still like the pretty UI you implemented for indicating caching. Is it possible to marry it with this approach, maybe by reporting an event of a special type and merging it with its parent under the hood? Of course, we can still do without it, but your suggestions on which particular event type we should report are very welcome.
It seems a little hack-y to report the caching event start and immediately report its end via two separate track_event() calls. Again, it's not really a problem, but I believe it would be useful to support zero-time events.

May 08 '24 20:05 whimo

Hi @whimo, You only need to send the metadata in the track_event corresponding to the llm end event. You would know at that time if it was cached, right?

We could easily at support for zero-time events, but that would still require two calls: one for llm start, and one to update this event to say it was a cached event.

In my opinion, it's more straightforward to use llm end for that, especially because cache hits still take some time (0.01s in our example).

What do you think?

May 09 '24 02:05 hughcrt

You only need to send the metadata in the track_event corresponding to the llm end event. You would know at that time if it was cached, right?

Technically we do know that, but I know no simple way to pass this knowledge to the point where we fire the llm-end or tool-end. The callbacks recieve limited data that is framework-specific and AFAIK does not include low-level request info. Also, we will have to manually support this for all variety of tools and LLMs that we cache.

The approach that I propose is much more universal. For example,

| track_event(llm start)
|| <request to e.g. OpenAI API happens>  // <- we don't control this part!
||| <cache hit occurs>
|||| track_event(cache start, parent=lunary.run_ctx.get())
|||| track_event(cache end)  // would be great to merge these two calls into one
||| <cached response is returned>
|| <cached response is returned>  // <- we don't control this part either
| track_event(llm end)

This would work regardless of the framework used and whether the events are fired manually or automatically e.g. via Langchain integration.

May 09 '24 08:05 whimo

Hi @whimo, got it, I think we can something like:

| track_event(llm start)
|| <request to e.g. OpenAI API happens>  // <- we don't control this part!
||| <cache hit occurs>
|||| track_event(cache start, parent=lunary.run_ctx.get())
||| <cached response is returned>
|| <cached response is returned>  // <- we don't control this part either
| track_event(llm end)

I'll keep you updated when the SDK will be updated with those changes.

May 09 '24 11:05 hughcrt

You can now use lunary.track_event("llm", "update", lunary.run_ctx.get(), metadata={"cache": True}). You don't need to update the SDK, we handle everything on the latest version of the server. Please reopen the issue if you have any problem making it work.

May 09 '24 21:05 hughcrt

🔥 Looks like what we needed, thank you!

May 10 '24 10:05 whimo

Hi @hughcrt, we ran into one more problem, so I'm reopening this as you instructed.

Callback handlers use their own event queue (defined here). When we call track_event from the cache manager, it uses Lunary's global queue by default (defined here). So the start and end events go into the callback handler's queue, and the update events with cache=True go into the global one. It seems like the update events are not linked correctly to their parents because they end up in different queues (everything works if we track the start/end events manually using the global queue).

So we need to figure out how to use same queue as the callback handler for reporting cache events. I see two possible ways:

To make LunaryCallbackHandler use the global queue (either by reusing it here or by allowing to provide queue as a constructor argument).
To store the queue in run_ctx so it can be accessed from our cache manager.

Please share your thoughts on this.

May 13 '24 11:05 whimo

Oops, seems like I don't have the rights to reopen the issue, so I just hope you'll recieve a notification 😄

May 13 '24 11:05 whimo

Hi @whimo, I will have a look and get back to you with a solution 🙂

May 13 '24 17:05 hughcrt

Hi @hughcrt, have you figured anything out? :)

May 17 '24 11:05 whimo

Hi @whimo, sorry the delay. In version 1.0.20, the callback handler now uses the global event queue. I will get back to you in a bit, when the update using context variables will be ready

May 17 '24 15:05 hughcrt

@whimo, in version 1.0.21, the event queue is in a context var, so it should solve your problem. Let me know if that works for you 🙂

May 17 '24 18:05 hughcrt

Hi @hughcrt, everything seems to be working as expected, thank you for your help!

May 20 '24 12:05 whimo

lunary lunary copied to clipboard

Best way to trace cached LLM and tool calls

lunary
lunary copied to clipboard