llm
llm copied to clipboard
Duplicate the --save feature from openai-to-sqlite similar
https://github.com/simonw/openai-to-sqlite/blob/361d98a7f260a1420e6e698481f298848b922253/README.md#saving-similarity-calculations-to-the-database
This is the feature that can be used to save calculated similarity scores to the database. I use it to serve related TILs on my TILs site: https://til.simonwillison.net/llms/openai-embeddings-related-content
openai-to-sqlite similar embeddings-bjcp-2021.db \
--all --save
And this feature too:
openai-to-sqlite similar embeddings-bjcp-2021.db \
'23G Gose' '01A American Light Lager' \
--save \
--recalculate-for-matches \
--count 20
The similarities
table is pretty simple: https://til.simonwillison.net/tils/similarities
CREATE TABLE [similarities] (
[id] TEXT,
[other_id] TEXT,
[score] FLOAT,
PRIMARY KEY ([id], [other_id])
);
For llm
I think I need to at the very least add a collection_id
column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.
Maybe similarity score functions should be provided by plugins, and stored in a scoring_functions
table with an integer primary key (as a foreign key from similarities
) plus a text column that stores the path to the function - so if it's in core it's llm.scoring.cosine_similarity
but if it's from some plugin it's llm_manhattan.manhattan
.
The same mechanism could work for chunking functions too, see:
- #220
For
llm
I think I need to at the very least add acollection_id
column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.
Since I have a migrations system in place I can ignore that idea for the moment and add it in the future if appropriate.
I'm going to implement --save
and --print
and --recalculate-for-matches
but not --table
.
I need to land this first, since it has a migration in already:
- #254
The migration for this will be:
@embeddings_migrations()
def m006_similarities(db):
db["similarities"].create({
"collection_id": int,
"id": str,
"other_id": str,
"score": float,
}, pk=("collection_id", "id", "other_id"))
The compound primary keys make this a bit harder, since sqlite-utils
and Datasette don't really support those for foreign keys yet. Already filed one bug:
- https://github.com/simonw/sqlite-utils/issues/594
This was getting a bit fiddly. decided to drop it from 0.10.