llm
llm copied to clipboard
Proposal: Embedding Loader plugins
Currently there are fragment loaders which allow you to create one or more fragments through a custom prefix "protocol." It seems like similar functionality could be implemented for embeddings: let me llm embed "url:https://foo.bar" or llm embed "readwise:articles" to pull in my bookmarks from Readwise Reader. Combining tool use with llm embed (see also https://github.com/daturkel/llm-tools-rag) has me very bullish on the possibilities of the embeddings db.
If you're on board, I'm happy to talk design and take a stab at implementing it.
This is a really interesting idea - why shouldn't llm embed accept URLs?
One option could be to directly port over the existing "fragments" concept to llm-embed - so anything you can access using llm -f ... would work with llm embed -f ... as well.
This would save on code and documentation and would mean that the existing fragment plugins all work too.
One point of potential confusion with this is that llm -f ... plugins can actually return multiple fragments. If that happens should those be concatenated together into a single embedding input or should they be treated as separate embedding requests? If separate, what should their stored IDs be?
I'd welcome a protoype of llm embed -f ... -f ... that emulates the existing prompt fragments mechanism, complete with plugin support etc. It may be as simple as copying this across to llm embed:
https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/llm/cli.py#L420-L426
And then using this function somewhere: https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/llm/cli.py#L95-L163
Happy to give this a shot. I agree that in theory the interface for fragments and embeddings can really be the same thing, and that fragments (and their plugins) can be an all purpose interface for pulling content into llm.
Is there any importance to what the id of an embedding is? In cases where it's ambiguous, is there any harm to using the fragment hashes?
The purpose of the id is so you can say e.g. "find related content to ID 5" and get back a list of IDs that you can then lookup in whatever other data store you are using.
Using content IDs for IDs isn't a terrible idea here.
Embeddings are often more useful if you store the content somewhere. The whole point of fragments is to avoid storing duplicate content in the database, so maybe this table grows a nullable fragment_id column that can be a foreign key to a fragment if the embedding was created using one?
https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/docs/embeddings/python-api.md#L199-L209
(Doesn't make sense for multiple fragments though, might need a new many-to-many table from fragments to embeddings for that.)
That existing content_hash column exists to help us avoid re-embedding content that we have already stored a vector for, since API calls to embed content have a cost.
I'm not convinced it's worth having a fragment_id column on embeddings to reference a fragment rather than dumping that content in the existing content or content_blob columns. The embeddings database isn't actually designed to always be the same as the logging database - that's why the schema is documented separately on https://llm.datasette.io/en/stable/embeddings/python-api.html#sql-schema as opposed to be including on https://llm.datasette.io/en/stable/logging.html#sql-schema
I think the simplest version of this just adds support for a single -f/--fragment option to llm embed, uses the existing resolve_fragments() function to resolve that (which handles URLs and fragment loader plugins), and then throws an error if what comes back is either an attachment or a more than one fragment. If it's a single fragment that gets treated as regular input and stored in the content or content_blog column.
@daturkel absolutely go ahead and have a go at this.
And since you're spending a bunch of time in embeddings at the moment, I'd love to get your thoughts on this issue I just filed:
- #1085
Just started cooking on this here. Sadly -f is already taken for embed.
Should embed-multi also be supported?