llm Duplicate the --save feature from openai-to-sqlite similar

https://github.com/simonw/openai-to-sqlite/blob/361d98a7f260a1420e6e698481f298848b922253/README.md#saving-similarity-calculations-to-the-database

This is the feature that can be used to save calculated similarity scores to the database. I use it to serve related TILs on my TILs site: https://til.simonwillison.net/llms/openai-embeddings-related-content

openai-to-sqlite similar embeddings-bjcp-2021.db \
  --all --save

And this feature too:

openai-to-sqlite similar embeddings-bjcp-2021.db \
  '23G Gose' '01A American Light Lager' \
  --save \
  --recalculate-for-matches \
  --count 20

Sep 05 '23 01:09 simonw

The similarities table is pretty simple: https://til.simonwillison.net/tils/similarities

CREATE TABLE [similarities] (
   [id] TEXT,
   [other_id] TEXT,
   [score] FLOAT,
   PRIMARY KEY ([id], [other_id])
);

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

Sep 05 '23 01:09 simonw

Maybe similarity score functions should be provided by plugins, and stored in a scoring_functions table with an integer primary key (as a foreign key from similarities) plus a text column that stores the path to the function - so if it's in core it's llm.scoring.cosine_similarity but if it's from some plugin it's llm_manhattan.manhattan.

The same mechanism could work for chunking functions too, see:

#220

Sep 05 '23 01:09 simonw

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

Since I have a migrations system in place I can ignore that idea for the moment and add it in the future if appropriate.

Sep 12 '23 01:09 simonw

I'm going to implement --save and --print and --recalculate-for-matches but not --table.

Sep 12 '23 01:09 simonw

I need to land this first, since it has a migration in already:

#254

Sep 12 '23 01:09 simonw

The migration for this will be:

@embeddings_migrations()
def m006_similarities(db):
    db["similarities"].create({
        "collection_id": int,
        "id": str,
        "other_id": str,
        "score": float,
    }, pk=("collection_id", "id", "other_id"))

Sep 12 '23 01:09 simonw

The compound primary keys make this a bit harder, since sqlite-utils and Datasette don't really support those for foreign keys yet. Already filed one bug:

https://github.com/simonw/sqlite-utils/issues/594

Sep 12 '23 04:09 simonw

This was getting a bit fiddly. decided to drop it from 0.10.

Sep 12 '23 04:09 simonw

llm llm copied to clipboard

Duplicate the --save feature from openai-to-sqlite similar

llm
llm copied to clipboard