llm
llm copied to clipboard
Make attachment content loading properly asyncio-friendly
In digging through the code I have a nasty suspicion that attachments are not fully
asynciosafe - I think sometimes when an async model is executing a prompt it may make a blocking call toattachment.base64_content()which callsattachment.content_bytes()which calls a blockinghttpx.get(). Fixing this will be tricky because it will involve fixing a bunch of existing plugins.
Originally posted by @simonw in https://github.com/simonw/llm/issues/1142#issuecomment-2926455362
Here's the problem, illustrated using the default OpenAI plugin:
https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L751-L756
That's a blocking call to build_messages() which does this in two places:
https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L538-L539
https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L581-L585
Which uses this:
https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/default_plugins/openai_models.py#L430-L435
Which calls this in llm.Attachment:
https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/models.py#L86-L98
... and that's a blocking call!
I see two potential fixes here.
- Go through that plugin and all other plugins and find their equivalent of that
build_messages()routine and find the bits where they doattachment.base64_content()and turn those intoawaitcalls, then build an e.gawait attachment.abase64_content()method or similar. This is tough and involves a lot of plugin edits, including several I do not own myself. - Prior to calling
await model.execute(prompt, ..., conversation, ...)we do something async to both thepromptand theconversationobjects that cause all of their attachments to be asynchronously pre-fetched such that it's not a blocking call when we later accessattachment.content_bytes()
That second option is tempting - it results in way less code modifications to plugins (actually no modifications to plugins at all), can be achieved entirely in core.
There's one huge catch. Some models (such as OpenAI) don't actually need to have the content fetched at all - they can pass the url field on an attachment directly to the model via the API, which saves us from having to do any fetching of content ourselves. But they can only do that for some fields (maybe OpenAI used to be able to do it for images but not for PDFs, though I think they fixed that).
I'm not 100% confident it's possible to pre-fetch URL-based attachments prior to calling await model.execute(...) and get it entirely right - and getting it wrong means wasting a whole lot of memory if the attachments are large.
Maybe that second option could work with an extra rule: There's an upper size limit on the amount of content that can be pre-fetched for an asyncio attachment, potentially a configurable one. If you want to deal with really large files you need to use an underlying model that supports URLs, or you need to switch to sync-LLM and run that in a threadpool from your asyncio application. OR you need to have your own separate code that fetches attachments from their URLs and writes them to disk.
We are already committed to sucking a LOT of information into memory anyway, for those APIs that require us to base64 encode the entire file and send that to the API.
Some APIs are aware of this limitation already. Gemini has a 20MB size limit on files, if you want to process larger than that you have to upload the file separately using their files API and then send a reference to that file to their API. I've not yet figured out how to build that feature for llm-gemini:
- https://github.com/simonw/llm-gemini/issues/19
So the fix here could be to have the models themselves declare a supports_attachment_urls = True property which, if set, causes the asyncio mode to not bother trying to pre-download content from attachment URLs before calling the execute() method.
Or another option: there's an await model.prepare_attachments(prompt, conversation) method which is called automatically before await model.execute(...) - that prepare method knows if the attachments need to be fetched asynchronously or not (maybe via a supports_attachment_urls thing, or custom logic).
If something goes wrong - a call is made to attachment.content_bytes() that needs to download content when the attachment knows it is running in asyncio context - it could raise a noisy error.
This feels like it might just work.
Getting a second opinion from Gemini 2.5 Flash:
llm -f /tmp/llm.txt \
-m gemini-2.5-flash-preview-05-20 \
-f issue:https://github.com/simonw/llm/issues/1143 \
-s 'Think hard about this problem and if the proposed solution makes sense, suggest alternatives, then suggest an implementation plan for if the proposed solution could work'
https://gist.github.com/simonw/6d007917183eebd32f7062c8746e3285 - 5 cents
Not a very useful answer, it didn't seem to understand my proposal to not have to rewrite build_messages() etc.
But it did make me think about an option 3: any time I need to fetch an attachment as in asyncio mode, do that in a thread pool. That's not a terrible option to be honest. I should probably be doing that for reading multi-MB files off disk anyway.
Flash also reminded me that there's another blocking call: the resolve_type() thing. Even if I'm passing a URL directly to a model that supports URL I may still need to do a head() request to check the content type, if that's an API that requires the content type rather than guessing itself: https://github.com/simonw/llm/blob/fd2f0ffbcd93176811ccc6fc0897b67f6bc570f3/llm/models.py#L72-L85
Easiest way to write this is probably to have an await attachment.prepare() method which is called in asyncio context prior to .execute().
The thread thing wouldn't help, because an operation in a thread can only unblock the event loop if you call asyncio.to_thread(fn) - so I would still need to re-engineer build_messages() etc to be async functions.
If I can get it to work the thing where the attachments are prepared prior to calling execute() would definitely be a lot less disruptive.
If I wasn't so keen to avoid rewriting a bunch of plugins, what would the solution look like?
I think it might be to have an AsyncAttachment class (similar to how there's AsyncResponse to accompany Response).
If your model is an async model, all attachments passed to it on both the prompt and the conversation are actually AsyncAttachments - you have to call await attachment.resolve_type() and await attachment.content_bytes() on them from your build_messages() equivalent methods.
Implementation detail: I could have a method on the regular llm.Attachment() class called something like attachment.to_async() which simply returns an AsyncAttachment with the same path / URL / etc properties.
Maybe I could even move in that direction - implement the workaround version first, but also have the AsyncAttachment mechanism and coach people to upgrade their plugins to use it.
Plugins could have a supports_async_attachments = False property which they flip to True when they want to start getting AsyncAttachment instances passed to them.
Getting a second opinion on that from o4-mini:
llm -f /tmp/llm.txt \
-m o4-mini \
-o reasoning_effort high \
-f issue:https://github.com/simonw/llm/issues/1143 \
-s 'Think hard about this problem and if the proposed solution makes sense, suggest alternatives, then suggest an implementation plan for if the proposed solution could work'
https://gist.github.com/simonw/de64a8d90c2718fb79368fd7a50746d6 - not a bad answer: https://gist.github.com/simonw/de64a8d90c2718fb79368fd7a50746d6 - 16 cents.
For the moment, I am going to mention this in the documentation as a restriction and link to this issue.
@simonw I've added a proposal for this in https://github.com/simonw/llm/pull/1195.
Instead of implementing an async version for all the non-async versions, we can use await asyncio.to_thread(sync_function), which should be safe for network IO. It can be tested using pyleak, which helps detect blocked event loops.