llm icon indicating copy to clipboard operation
llm copied to clipboard

Duplicate the --save feature from openai-to-sqlite similar

Open simonw opened this issue 10 months ago • 8 comments

https://github.com/simonw/openai-to-sqlite/blob/361d98a7f260a1420e6e698481f298848b922253/README.md#saving-similarity-calculations-to-the-database

This is the feature that can be used to save calculated similarity scores to the database. I use it to serve related TILs on my TILs site: https://til.simonwillison.net/llms/openai-embeddings-related-content

openai-to-sqlite similar embeddings-bjcp-2021.db \
  --all --save

And this feature too:

openai-to-sqlite similar embeddings-bjcp-2021.db \
  '23G Gose' '01A American Light Lager' \
  --save \
  --recalculate-for-matches \
  --count 20

simonw avatar Sep 05 '23 01:09 simonw

The similarities table is pretty simple: https://til.simonwillison.net/tils/similarities

CREATE TABLE [similarities] (
   [id] TEXT,
   [other_id] TEXT,
   [score] FLOAT,
   PRIMARY KEY ([id], [other_id])
);

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

simonw avatar Sep 05 '23 01:09 simonw

Maybe similarity score functions should be provided by plugins, and stored in a scoring_functions table with an integer primary key (as a foreign key from similarities) plus a text column that stores the path to the function - so if it's in core it's llm.scoring.cosine_similarity but if it's from some plugin it's llm_manhattan.manhattan.

The same mechanism could work for chunking functions too, see:

  • #220

simonw avatar Sep 05 '23 01:09 simonw

For llm I think I need to at the very least add a collection_id column. But maybe it should support saving multiple different types of score too? I'm going to grow beyond cosine similarity at some point.

Since I have a migrations system in place I can ignore that idea for the moment and add it in the future if appropriate.

simonw avatar Sep 12 '23 01:09 simonw

I'm going to implement --save and --print and --recalculate-for-matches but not --table.

simonw avatar Sep 12 '23 01:09 simonw

I need to land this first, since it has a migration in already:

  • #254

simonw avatar Sep 12 '23 01:09 simonw

The migration for this will be:

@embeddings_migrations()
def m006_similarities(db):
    db["similarities"].create({
        "collection_id": int,
        "id": str,
        "other_id": str,
        "score": float,
    }, pk=("collection_id", "id", "other_id"))

simonw avatar Sep 12 '23 01:09 simonw

The compound primary keys make this a bit harder, since sqlite-utils and Datasette don't really support those for foreign keys yet. Already filed one bug:

  • https://github.com/simonw/sqlite-utils/issues/594

simonw avatar Sep 12 '23 04:09 simonw

This was getting a bit fiddly. decided to drop it from 0.10.

simonw avatar Sep 12 '23 04:09 simonw