lance icon indicating copy to clipboard operation
lance copied to clipboard

`load_indices` should include index config

Open westonpace opened this issue 11 months ago • 4 comments

Ideally the output from load_indices would allow to both know

  • Whether the index is an IvfPq index or a Btree index
  • What parameters were used to train the index

Today I can kind of guess the first one based on the type of column but once we add more vector index types this will no longer be possible.

I have no way today of getting the parameters. This can be very useful because users may forget these things and want to examine them (e.g. because they've learned more about vector indices and now they want to know if they need to rebuild their index or not).

westonpace avatar Mar 07 '24 15:03 westonpace

I thought the parameters were in dataset.stats.index_stats?

(Pdb) from pprint import pprint
(Pdb) pprint(dataset.stats.index_stats(index_name))
{'index_type': 'IVF',
 'indices': [{'centroids': [[0.5594622492790222,
                             ...,
                             0.5300236940383911,
                             0.5513307452201843]],
              'index_type': 'IVF',
              'metric_type': 'l2',
              'num_partitions': 2,
              'partitions': [{'size': 238}, {'size': 274}],
              'sub_index': {'dimension': 32,
                            'index_type': 'PQ',
                            'metric_type': 'l2',
                            'nbits': 8,
                            'num_sub_vectors': 1},
              'uri': '/private/var/folders/09/h28jzzv164n6bn4ldrhhm73m0000gn/T/pytest-of-willjones/pytest-27/test_count_index_rows0/test/_indices/ef525f0b-4c87-42d9-9ace-3e2437b10c71/index.idx',
              'uuid': 'ef525f0b-4c87-42d9-9ace-3e2437b10c71'}],
 'name': 'a_idx',
 'num_indexed_fragments': 1,
 'num_indexed_rows': 512,
 'num_indices': 1,
 'num_unindexed_fragments': 0,
 'num_unindexed_rows': 0}

wjones127 avatar Mar 12 '24 16:03 wjones127

I thought the parameters were in dataset.stats.index_stats?

@wjones127

They are but the statistics are experimental / unstable. Since index parameters are stable, we should have a stable way of retrieving them.

I'm mainly filing this because I want to be able to load the index config in LanceDb and I'm not sure we want to expose raw stats in LanceDb.

westonpace avatar Mar 12 '24 16:03 westonpace

I hope we can make them more stable soon. IIRC the main impetus for exposing them is making it so users can retrieve and re-use the IVF centroids.

wjones127 avatar Mar 12 '24 20:03 wjones127

We also have some use cases where we need to check the metric_type. Agreed, would be nice to have a stable way of getting it.

albertlockett avatar Mar 12 '24 20:03 albertlockett