velox
velox copied to clipboard
LazyVector support for Parquet Reader
Description
Currently Parquet reader does not produce LazyVector
, as isTopLevel()
is always false for Parquet column readers, and advanceFieldReader
is not implemented (this one maybe not needed). This results in large performance gap between DWRF/ORC and Parquet reader. Would be nice if we get the LazyVector
working for Parquet reader.
@Yuhta Thanks Jimmy! Yes I've been thinking of doing that. Just curious, what's the perf gap between DWRF and Parquet is like in your tests and what workload?
I have not measured, but the LazyVector mostly benefit when you have a non-pushdown filter (i.e. remaining expression) that on some smaller key columns, with very high filtering rate (>99.9%), and large lazy payload columns (nested row/array/map), then we can avoid reading majority of the payload column content. A typical example is some deterministic random sampling (e.g. where hash(id) % 1000 = 0
).
Got it thanks @Yuhta ! We will plan to add it later this year.
@Yuhta Is there any documentation on what LazyVector exactly is? Wanted to understand how it is decreasing the data read
I saw the comment here. From the comments looks like if there is filter pushdown in scan, lazy loading can decrease some loading into memory. However I am confused how this works in the context of scans, does it mean that we read less data from the source or does it mean that we read the same data from the source just "load" (convert from orc/parquet to internal velox format) lazily based on filters?
@jaystarshot In most cases we do the same amount of IO, but saving the decoding time and memory used to converting file format into Velox vector
got it thanks!