velox LazyVector support for Parquet Reader

Description

Currently Parquet reader does not produce LazyVector, as isTopLevel() is always false for Parquet column readers, and advanceFieldReader is not implemented (this one maybe not needed). This results in large performance gap between DWRF/ORC and Parquet reader. Would be nice if we get the LazyVector working for Parquet reader.

Apr 22 '24 20:04 Yuhta

@Yuhta Thanks Jimmy! Yes I've been thinking of doing that. Just curious, what's the perf gap between DWRF and Parquet is like in your tests and what workload?

May 05 '24 00:05 yingsu00

I have not measured, but the LazyVector mostly benefit when you have a non-pushdown filter (i.e. remaining expression) that on some smaller key columns, with very high filtering rate (>99.9%), and large lazy payload columns (nested row/array/map), then we can avoid reading majority of the payload column content. A typical example is some deterministic random sampling (e.g. where hash(id) % 1000 = 0).

May 06 '24 18:05 Yuhta

Got it thanks @Yuhta ! We will plan to add it later this year.

May 07 '24 23:05 yingsu00

@Yuhta Is there any documentation on what LazyVector exactly is? Wanted to understand how it is decreasing the data read

Sep 17 '24 02:09 jaystarshot

I saw the comment here. From the comments looks like if there is filter pushdown in scan, lazy loading can decrease some loading into memory. However I am confused how this works in the context of scans, does it mean that we read less data from the source or does it mean that we read the same data from the source just "load" (convert from orc/parquet to internal velox format) lazily based on filters?

Sep 17 '24 02:09 jaystarshot

@jaystarshot In most cases we do the same amount of IO, but saving the decoding time and memory used to converting file format into Velox vector

Sep 17 '24 15:09 Yuhta

got it thanks!

Sep 17 '24 16:09 jaystarshot

velox velox copied to clipboard

LazyVector support for Parquet Reader

Description

velox
velox copied to clipboard