lance icon indicating copy to clipboard operation
lance copied to clipboard

Top-level index concept with stable ID

Open rpgreen opened this issue 9 months ago • 1 comments

Introduce UUIDs for indexes that do not change during indexing and are stable for (index_type, index_name, column_name) 

rpgreen avatar May 09 '24 11:05 rpgreen

Background

  • Right now, "indices" in the Lance manifest refer to segments / pieces of an index. For example, if you create a vector index, that creates an index segment. If you add data and then incrementally index with a delta index, you will now have two "indices", that together make up the full vector index. That full vector index doesn't have a proper entity in Lance. It's just an aggregation of all index segments that share the same name within the same table.
  • A related problem: because an Index is attached to a single file, there is no such thing as an empty index. It is impossible to add an index to a newly created empty table. This strikes many users as odd, as they expect to be able to create a table, add an index, and then start inserting data.

Indexes

We probably need a terminology change:

  • Index -> IndexSegment
  • Index becomes a top-level index configuration

Users should be able to specify an index configuration up front, which will be saved into the Index entry. The Index will have a UUID in addition to a name, so that it can be differentiated from previous versions of the same name.

There's some rules to figure out with what's allowed for indexing. BTree indices don't really require any up-front training, and would be a good candidate to demonstrate how we can create an index without any data and incrementally update. However, anything with IVF requires some training or algorithm to create and split clusters. That needs special care and design.

wjones127 avatar May 09 '24 16:05 wjones127