llm icon indicating copy to clipboard operation
llm copied to clipboard

Make attachment content loading properly asyncio-friendly

Open simonw opened this issue 6 months ago • 13 comments

In digging through the code I have a nasty suspicion that attachments are not fully asyncio safe - I think sometimes when an async model is executing a prompt it may make a blocking call to attachment.base64_content() which calls attachment.content_bytes() which calls a blocking httpx.get(). Fixing this will be tricky because it will involve fixing a bunch of existing plugins.

Originally posted by @simonw in https://github.com/simonw/llm/issues/1142#issuecomment-2926455362

simonw avatar Jun 01 '25 04:06 simonw

Here's the problem, illustrated using the default OpenAI plugin:

https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L751-L756

That's a blocking call to build_messages() which does this in two places:

https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L538-L539

https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L581-L585

Which uses this:

https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L430-L435

Which calls this in llm.Attachment:

https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/models.py#L86-L98

... and that's a blocking call!

simonw avatar Jun 01 '25 04:06 simonw

I see two potential fixes here.

  1. Go through that plugin and all other plugins and find their equivalent of that build_messages() routine and find the bits where they do attachment.base64_content() and turn those into await calls, then build an e.g await attachment.abase64_content() method or similar. This is tough and involves a lot of plugin edits, including several I do not own myself.
  2. Prior to calling await model.execute(prompt, ..., conversation, ...) we do something async to both the prompt and the conversation objects that cause all of their attachments to be asynchronously pre-fetched such that it's not a blocking call when we later access attachment.content_bytes()

That second option is tempting - it results in way less code modifications to plugins (actually no modifications to plugins at all), can be achieved entirely in core.

There's one huge catch. Some models (such as OpenAI) don't actually need to have the content fetched at all - they can pass the url field on an attachment directly to the model via the API, which saves us from having to do any fetching of content ourselves. But they can only do that for some fields (maybe OpenAI used to be able to do it for images but not for PDFs, though I think they fixed that).

I'm not 100% confident it's possible to pre-fetch URL-based attachments prior to calling await model.execute(...) and get it entirely right - and getting it wrong means wasting a whole lot of memory if the attachments are large.

simonw avatar Jun 01 '25 04:06 simonw

Maybe that second option could work with an extra rule: There's an upper size limit on the amount of content that can be pre-fetched for an asyncio attachment, potentially a configurable one. If you want to deal with really large files you need to use an underlying model that supports URLs, or you need to switch to sync-LLM and run that in a threadpool from your asyncio application. OR you need to have your own separate code that fetches attachments from their URLs and writes them to disk.

simonw avatar Jun 01 '25 04:06 simonw

We are already committed to sucking a LOT of information into memory anyway, for those APIs that require us to base64 encode the entire file and send that to the API.

Some APIs are aware of this limitation already. Gemini has a 20MB size limit on files, if you want to process larger than that you have to upload the file separately using their files API and then send a reference to that file to their API. I've not yet figured out how to build that feature for llm-gemini:

  • https://github.com/simonw/llm-gemini/issues/19

simonw avatar Jun 01 '25 04:06 simonw

So the fix here could be to have the models themselves declare a supports_attachment_urls = True property which, if set, causes the asyncio mode to not bother trying to pre-download content from attachment URLs before calling the execute() method.

Or another option: there's an await model.prepare_attachments(prompt, conversation) method which is called automatically before await model.execute(...) - that prepare method knows if the attachments need to be fetched asynchronously or not (maybe via a supports_attachment_urls thing, or custom logic).

If something goes wrong - a call is made to attachment.content_bytes() that needs to download content when the attachment knows it is running in asyncio context - it could raise a noisy error.

This feels like it might just work.

simonw avatar Jun 01 '25 04:06 simonw

Getting a second opinion from Gemini 2.5 Flash:

llm -f /tmp/llm.txt \
  -m gemini-2.5-flash-preview-05-20 \ 
  -f issue:https://github.com/simonw/llm/issues/1143 \
  -s 'Think hard about this problem and if the proposed solution makes sense, suggest alternatives, then suggest an implementation plan for if the proposed solution could work'

https://gist.github.com/simonw/6d007917183eebd32f7062c8746e3285 - 5 cents

Not a very useful answer, it didn't seem to understand my proposal to not have to rewrite build_messages() etc.

But it did make me think about an option 3: any time I need to fetch an attachment as in asyncio mode, do that in a thread pool. That's not a terrible option to be honest. I should probably be doing that for reading multi-MB files off disk anyway.

simonw avatar Jun 01 '25 04:06 simonw

Flash also reminded me that there's another blocking call: the resolve_type() thing. Even if I'm passing a URL directly to a model that supports URL I may still need to do a head() request to check the content type, if that's an API that requires the content type rather than guessing itself: https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/models.py#L72-L85

simonw avatar Jun 01 '25 04:06 simonw

Easiest way to write this is probably to have an await attachment.prepare() method which is called in asyncio context prior to .execute().

simonw avatar Jun 01 '25 04:06 simonw

The thread thing wouldn't help, because an operation in a thread can only unblock the event loop if you call asyncio.to_thread(fn) - so I would still need to re-engineer build_messages() etc to be async functions.

If I can get it to work the thing where the attachments are prepared prior to calling execute() would definitely be a lot less disruptive.

simonw avatar Jun 01 '25 04:06 simonw

If I wasn't so keen to avoid rewriting a bunch of plugins, what would the solution look like?

I think it might be to have an AsyncAttachment class (similar to how there's AsyncResponse to accompany Response).

If your model is an async model, all attachments passed to it on both the prompt and the conversation are actually AsyncAttachments - you have to call await attachment.resolve_type() and await attachment.content_bytes() on them from your build_messages() equivalent methods.

Implementation detail: I could have a method on the regular llm.Attachment() class called something like attachment.to_async() which simply returns an AsyncAttachment with the same path / URL / etc properties.

simonw avatar Jun 01 '25 04:06 simonw

Maybe I could even move in that direction - implement the workaround version first, but also have the AsyncAttachment mechanism and coach people to upgrade their plugins to use it.

Plugins could have a supports_async_attachments = False property which they flip to True when they want to start getting AsyncAttachment instances passed to them.

simonw avatar Jun 01 '25 04:06 simonw

Getting a second opinion on that from o4-mini:

llm -f /tmp/llm.txt \
  -m o4-mini \
  -o reasoning_effort high \
  -f issue:https://github.com/simonw/llm/issues/1143 \
  -s 'Think hard about this problem and if the proposed solution makes sense, suggest alternatives, then suggest an implementation plan for if the proposed solution could work'

https://gist.github.com/simonw/de64a8d90c2718fb79368fd7a50746d6 - not a bad answer: https://gist.github.com/simonw/de64a8d90c2718fb79368fd7a50746d6 - 16 cents.

simonw avatar Jun 01 '25 05:06 simonw

For the moment, I am going to mention this in the documentation as a restriction and link to this issue.

simonw avatar Jun 01 '25 19:06 simonw

@simonw I've added a proposal for this in https://github.com/simonw/llm/pull/1195.

Instead of implementing an async version for all the non-async versions, we can use await asyncio.to_thread(sync_function), which should be safe for network IO. It can be tested using pyleak, which helps detect blocked event loops.

deepankarm avatar Jul 02 '25 08:07 deepankarm