Simon Willison
Simon Willison
Maybe the solution is something like this: - `id` - the ID of the document - `strategy` - the optional chunking strategy, e.g. paragraphs or sections or sentences - `chunk`...
The alternative to the above would be just adding `strategy` and `chunk_index` columns to the existing `embeddings` table. I'm likely WAY over-thinking the cost of continuing to use a string...
Here's that table right now: ```sql CREATE TABLE "embeddings" ( [collection_id] INTEGER REFERENCES [collections]([id]), [id] TEXT, [embedding] BLOB, [content] TEXT, [content_hash] BLOB, [metadata] TEXT, [updated] INTEGER, PRIMARY KEY ([collection_id], [id])...
Yes, I'm over-thinking this schema. If a user cares that much about the space taken up by those IDs they can themselves use shorter IDs and implement their own lookup...
Ran an experiment here: https://chat.openai.com/share/3f53008c-0d45-438b-a801-44ad88990f25 If one of the columns in a compound primary key can contain null, it's possible for two rows to have the same primary key. I...
A really simple chunker I could include by default would be `lines` - it splits on newlines and discards any empty ones. A chunker function gets fed text and returns...
Initial prototype thoughts: ```diff diff --git a/llm/default_plugins/chunkers.py b/llm/default_plugins/chunkers.py new file mode 100644 index 0000000..23fa750 --- /dev/null +++ b/llm/default_plugins/chunkers.py @@ -0,0 +1,13 @@ +from llm import hookimpl + + +def lines(text):...
If I do get chunking working, the obvious related feature is a search that's "smarter" than the current `llm similar` command, by being chunk-aware. I'm not sure what this would...
There's a related feature here that I might want to roll into this database schema: the ability to attach embeddings to a document that aren't actually from its content at...
I had trouble with that prompt. I really want a newline-delimited list of questions, but: ```bash cat docs/python-api.md | llm -s 'Questions that are answered by this document, as a...