lance
lance copied to clipboard
Top-level index concept with stable ID
Introduce UUIDs for indexes that do not change during indexing and are stable for (index_type, index_name, column_name)
Background
- Right now, "indices" in the Lance manifest refer to segments / pieces of an index. For example, if you create a vector index, that creates an index segment. If you add data and then incrementally index with a delta index, you will now have two "indices", that together make up the full vector index. That full vector index doesn't have a proper entity in Lance. It's just an aggregation of all index segments that share the same name within the same table.
- A related problem: because an
Index
is attached to a single file, there is no such thing as an empty index. It is impossible to add an index to a newly created empty table. This strikes many users as odd, as they expect to be able to create a table, add an index, and then start inserting data.
Indexes
We probably need a terminology change:
-
Index
->IndexSegment
-
Index
becomes a top-level index configuration
Users should be able to specify an index configuration up front, which will be saved into the Index
entry. The Index
will have a UUID in addition to a name, so that it can be differentiated from previous versions of the same name.
There's some rules to figure out with what's allowed for indexing. BTree indices don't really require any up-front training, and would be a good candidate to demonstrate how we can create an index without any data and incrementally update. However, anything with IVF requires some training or algorithm to create and split clusters. That needs special care and design.