lance
lance copied to clipboard
perf: use v2 to write the sorted IVF partition files
The old v1 approach created a row group for every partition (including empty partitions). This approach converts the input into a List<Struct<...>> array where each row is a partition of data. Empty partitions are not included in the array.
This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.
Testing locally I see a bunch of failures from test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call to shuffle_dataset is running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.
Testing locally I see a bunch of failures from
test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call toshuffle_datasetis running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.
shuffle_dataset would be called only once during the entire indexing progress.
is that a recall assertion failure? I fixed such an issue before
This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.
We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?
https://github.com/lancedb/lance/pull/2384#discussion_r1611963145
We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?
Yeah, I have to concatenate for v2 (v2 can write multiple arrays but I can't make a list array from multiple arrays) so that might explain it.
Codecov Report
Attention: Patch coverage is 77.49004% with 113 lines in your changes missing coverage. Please review.
Please upload report for BASE (
main@9703e50). Learn more about missing BASE report. Report is 7 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2492 +/- ##
=======================================
Coverage ? 79.61%
=======================================
Files ? 208
Lines ? 59425
Branches ? 59425
=======================================
Hits ? 47314
Misses ? 9387
Partials ? 2724
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 79.61% <77.49%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Closing as this will be tackled piece by piece in other PRs.