lance
lance copied to clipboard
`load_indices` should include index config
Ideally the output from load_indices
would allow to both know
- Whether the index is an IvfPq index or a Btree index
- What parameters were used to train the index
Today I can kind of guess the first one based on the type of column but once we add more vector index types this will no longer be possible.
I have no way today of getting the parameters. This can be very useful because users may forget these things and want to examine them (e.g. because they've learned more about vector indices and now they want to know if they need to rebuild their index or not).
I thought the parameters were in dataset.stats.index_stats
?
(Pdb) from pprint import pprint
(Pdb) pprint(dataset.stats.index_stats(index_name))
{'index_type': 'IVF',
'indices': [{'centroids': [[0.5594622492790222,
...,
0.5300236940383911,
0.5513307452201843]],
'index_type': 'IVF',
'metric_type': 'l2',
'num_partitions': 2,
'partitions': [{'size': 238}, {'size': 274}],
'sub_index': {'dimension': 32,
'index_type': 'PQ',
'metric_type': 'l2',
'nbits': 8,
'num_sub_vectors': 1},
'uri': '/private/var/folders/09/h28jzzv164n6bn4ldrhhm73m0000gn/T/pytest-of-willjones/pytest-27/test_count_index_rows0/test/_indices/ef525f0b-4c87-42d9-9ace-3e2437b10c71/index.idx',
'uuid': 'ef525f0b-4c87-42d9-9ace-3e2437b10c71'}],
'name': 'a_idx',
'num_indexed_fragments': 1,
'num_indexed_rows': 512,
'num_indices': 1,
'num_unindexed_fragments': 0,
'num_unindexed_rows': 0}
I thought the parameters were in dataset.stats.index_stats?
@wjones127
They are but the statistics are experimental / unstable. Since index parameters are stable, we should have a stable way of retrieving them.
I'm mainly filing this because I want to be able to load the index config in LanceDb and I'm not sure we want to expose raw stats in LanceDb.
I hope we can make them more stable soon. IIRC the main impetus for exposing them is making it so users can retrieve and re-use the IVF centroids.
We also have some use cases where we need to check the metric_type. Agreed, would be nice to have a stable way of getting it.