Simon Willison comments

Results 2614 comments of


                                            Simon Willison

Support for chunked embeddings

Maybe the solution is something like this: - `id` - the ID of the document - `strategy` - the optional chunking strategy, e.g. paragraphs or sections or sentences - `chunk`...

Support for chunked embeddings

The alternative to the above would be just adding `strategy` and `chunk_index` columns to the existing `embeddings` table. I'm likely WAY over-thinking the cost of continuing to use a string...

Support for chunked embeddings

Here's that table right now: ```sql CREATE TABLE "embeddings" ( [collection_id] INTEGER REFERENCES [collections]([id]), [id] TEXT, [embedding] BLOB, [content] TEXT, [content_hash] BLOB, [metadata] TEXT, [updated] INTEGER, PRIMARY KEY ([collection_id], [id])...

Support for chunked embeddings

Yes, I'm over-thinking this schema. If a user cares that much about the space taken up by those IDs they can themselves use shorter IDs and implement their own lookup...

Support for chunked embeddings

Ran an experiment here: https://chat.openai.com/share/3f53008c-0d45-438b-a801-44ad88990f25 If one of the columns in a compound primary key can contain null, it's possible for two rows to have the same primary key. I...

Support for chunked embeddings

A really simple chunker I could include by default would be `lines` - it splits on newlines and discards any empty ones. A chunker function gets fed text and returns...

Support for chunked embeddings

Initial prototype thoughts: ```diff diff --git a/llm/default_plugins/chunkers.py b/llm/default_plugins/chunkers.py new file mode 100644 index 0000000..23fa750 --- /dev/null +++ b/llm/default_plugins/chunkers.py @@ -0,0 +1,13 @@ +from llm import hookimpl + + +def lines(text):...