lance
lance copied to clipboard
Filtering file: Parquet and Feather twice as fast
I look forward to moving some workload to Lance, so I ran a test for the filtering performance. (for retrieving data by row number, Lance is way better than Parquet or Feather)
When filtering a dataset, I had this performance:
- Lance avg processing time 33 seconds
- Feather avg processing time 14 seconds
- Parquet avg processing time 19 seconds
Is it expected that the other formats are about twice as fast for filtering?
Here is the benchmark code on Google Colab: https://colab.research.google.com/drive/1iG1MXJV-9hrqm4YcO_MzR0Qt67sMFfEj?usp=sharing
It's the benchmark 2
Thanks Luca. This very help to get concrete numbers here.
We haven't implemented stats based pruning yet in Lance, which would make the filtering path faster. Also, what's the result set size after filtering? If you're looking for say 10 rows from 1M, Lance will enable you to read much less data. But if you're retrieving 100K from 1M, then that's much closer to a regular scan.
Lastly, not sure if there's some version mismatch but when i tried to run the colab notebook i get an error:
Thanks @lucazanna !
Lance is currently optimized the filter for large blob columns, for example, if you have a dataset like <image:binary, attr:int>
, to run a query like SELECT image FROM .. WHERE attr > 10 and att < 50
is much faster than parquet due to the saved scans.
We have not optimized the filter / select small columns yet, as shown in your benchmark, especially a few techniques we can use to speed up.
- Better compression and encoding (#352 and some others compressions without scarify random access)
- Row group level filtering (#11)
- Partition pruning (#458)
Since parquet uses similar technique, we expect eventually lance can reach to the similar filter performance as parquet.
Hi @changhiskhan and @eddyxu ,
thank you for your reply.
For the code, you are right. I had changed the code without saving it.
It should have been .sink_parquet()
instead of write_parquet()
. Now it's fixed
When you talk about large blob columns, I am guessing that you mean columns with a high cardinality (high number of possible values)? Indeed in the example, the column pickup_minute only has 60 possible values.
So for example using it on latitude and longitude values (to filter the latitude/longitude values within 100 metres of a given point) then I should get better performance for Lance? (similar to the attr example)
Noted for the improvements you have planned for the filtering capabilities. You are building an exciting project, so trying to understand better what it can already do and what it will be able to do in the future
Hey, @lucazanna , good questions.
By large blobs
, we mean that a cell of data is likely large than 1KB, i.e., an image, lidar points, or a big json blob string.
Keeping this open for us to re-run benchmarks once the partitioning and stats work is complete