lance icon indicating copy to clipboard operation
lance copied to clipboard

Filtering file: Parquet and Feather twice as fast

Open lucazanna opened this issue 1 year ago • 4 comments

I look forward to moving some workload to Lance, so I ran a test for the filtering performance. (for retrieving data by row number, Lance is way better than Parquet or Feather)

When filtering a dataset, I had this performance:

  • Lance avg processing time 33 seconds
  • Feather avg processing time 14 seconds
  • Parquet avg processing time 19 seconds

Is it expected that the other formats are about twice as fast for filtering?

Here is the benchmark code on Google Colab: https://colab.research.google.com/drive/1iG1MXJV-9hrqm4YcO_MzR0Qt67sMFfEj?usp=sharing

It's the benchmark 2

lucazanna avatar Mar 28 '23 21:03 lucazanna

Thanks Luca. This very help to get concrete numbers here.

We haven't implemented stats based pruning yet in Lance, which would make the filtering path faster. Also, what's the result set size after filtering? If you're looking for say 10 rows from 1M, Lance will enable you to read much less data. But if you're retrieving 100K from 1M, then that's much closer to a regular scan.

Lastly, not sure if there's some version mismatch but when i tried to run the colab notebook i get an error: image

changhiskhan avatar Mar 28 '23 21:03 changhiskhan

Thanks @lucazanna !

Lance is currently optimized the filter for large blob columns, for example, if you have a dataset like <image:binary, attr:int>, to run a query like SELECT image FROM .. WHERE attr > 10 and att < 50 is much faster than parquet due to the saved scans.

We have not optimized the filter / select small columns yet, as shown in your benchmark, especially a few techniques we can use to speed up.

  1. Better compression and encoding (#352 and some others compressions without scarify random access)
  2. Row group level filtering (#11)
  3. Partition pruning (#458)

Since parquet uses similar technique, we expect eventually lance can reach to the similar filter performance as parquet.

eddyxu avatar Mar 28 '23 21:03 eddyxu

Hi @changhiskhan and @eddyxu ,

thank you for your reply.

For the code, you are right. I had changed the code without saving it. It should have been .sink_parquet() instead of write_parquet(). Now it's fixed

When you talk about large blob columns, I am guessing that you mean columns with a high cardinality (high number of possible values)? Indeed in the example, the column pickup_minute only has 60 possible values.

So for example using it on latitude and longitude values (to filter the latitude/longitude values within 100 metres of a given point) then I should get better performance for Lance? (similar to the attr example)

Noted for the improvements you have planned for the filtering capabilities. You are building an exciting project, so trying to understand better what it can already do and what it will be able to do in the future

lucazanna avatar Mar 29 '23 14:03 lucazanna

Hey, @lucazanna , good questions.

By large blobs, we mean that a cell of data is likely large than 1KB, i.e., an image, lidar points, or a big json blob string.

eddyxu avatar Mar 29 '23 23:03 eddyxu

Keeping this open for us to re-run benchmarks once the partitioning and stats work is complete

changhiskhan avatar Jul 02 '23 22:07 changhiskhan