llm icon indicating copy to clipboard operation
llm copied to clipboard

Proposal: Embedding Loader plugins

Open daturkel opened this issue 6 months ago • 9 comments

Currently there are fragment loaders which allow you to create one or more fragments through a custom prefix "protocol." It seems like similar functionality could be implemented for embeddings: let me llm embed "url:https://foo.bar" or llm embed "readwise:articles" to pull in my bookmarks from Readwise Reader. Combining tool use with llm embed (see also https://github.com/daturkel/llm-tools-rag) has me very bullish on the possibilities of the embeddings db.

If you're on board, I'm happy to talk design and take a stab at implementing it.

daturkel avatar May 25 '25 20:05 daturkel

This is a really interesting idea - why shouldn't llm embed accept URLs?

One option could be to directly port over the existing "fragments" concept to llm-embed - so anything you can access using llm -f ... would work with llm embed -f ... as well.

This would save on code and documentation and would mean that the existing fragment plugins all work too.

One point of potential confusion with this is that llm -f ... plugins can actually return multiple fragments. If that happens should those be concatenated together into a single embedding input or should they be treated as separate embedding requests? If separate, what should their stored IDs be?

simonw avatar May 25 '25 20:05 simonw

I'd welcome a protoype of llm embed -f ... -f ... that emulates the existing prompt fragments mechanism, complete with plugin support etc. It may be as simple as copying this across to llm embed:

https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/llm/cli.py#L420-L426

And then using this function somewhere: https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/llm/cli.py#L95-L163

simonw avatar May 25 '25 20:05 simonw

Happy to give this a shot. I agree that in theory the interface for fragments and embeddings can really be the same thing, and that fragments (and their plugins) can be an all purpose interface for pulling content into llm.

Is there any importance to what the id of an embedding is? In cases where it's ambiguous, is there any harm to using the fragment hashes?

daturkel avatar May 25 '25 20:05 daturkel

The purpose of the id is so you can say e.g. "find related content to ID 5" and get back a list of IDs that you can then lookup in whatever other data store you are using.

Using content IDs for IDs isn't a terrible idea here.

Embeddings are often more useful if you store the content somewhere. The whole point of fragments is to avoid storing duplicate content in the database, so maybe this table grows a nullable fragment_id column that can be a foreign key to a fragment if the embedding was created using one?

https://github.com/simonw/llm/blob/abc4f473f44163b8f9d6017ac4166dee5b49538c/docs/embeddings/python-api.md#L199-L209

(Doesn't make sense for multiple fragments though, might need a new many-to-many table from fragments to embeddings for that.)

simonw avatar May 25 '25 20:05 simonw

That existing content_hash column exists to help us avoid re-embedding content that we have already stored a vector for, since API calls to embed content have a cost.

simonw avatar May 25 '25 20:05 simonw

I'm not convinced it's worth having a fragment_id column on embeddings to reference a fragment rather than dumping that content in the existing content or content_blob columns. The embeddings database isn't actually designed to always be the same as the logging database - that's why the schema is documented separately on https://llm.datasette.io/en/stable/embeddings/python-api.html#sql-schema as opposed to be including on https://llm.datasette.io/en/stable/logging.html#sql-schema

simonw avatar May 25 '25 20:05 simonw

I think the simplest version of this just adds support for a single -f/--fragment option to llm embed, uses the existing resolve_fragments() function to resolve that (which handles URLs and fragment loader plugins), and then throws an error if what comes back is either an attachment or a more than one fragment. If it's a single fragment that gets treated as regular input and stored in the content or content_blog column.

simonw avatar May 25 '25 20:05 simonw

@daturkel absolutely go ahead and have a go at this.

And since you're spending a bunch of time in embeddings at the moment, I'd love to get your thoughts on this issue I just filed:

  • #1085

simonw avatar May 25 '25 20:05 simonw

Just started cooking on this here. Sadly -f is already taken for embed.

Should embed-multi also be supported?

daturkel avatar May 25 '25 23:05 daturkel