parquet-java
parquet-java copied to clipboard
PARQUET-131: Vectorized reader initial implementation
This PR implements an initial version of a vectorized reader and adds various benchmarks (VectorizedReadBenchmarks) and unit tests (ParquetVectorReaderTestSuite) for that.
Two new public methods are added to the ParquetReader and ParquetRecordReader classes:
RowBatch nextBatch(RowBatch previous, Class<T> clazz): used for reading complex types/arbitrary objects and calls the converters to materialize the records.RowBatch nextBatch(RowBatch previous): used for reading primitive types and does not call the converters at all.
@rdblue and @julienledem can you please review?
Ack: Thank you @zhenxiao and @dongche for your contributions.
let's generate the tests files in the test rather than checking in binary files: parquet-hadoop/src/test/resources/map_string_string.parquet etc
I think we should separate the vectorized and record reader a little more. The RecordReader seems to have too many responsibilities right now
Thanks for the contribution, see my comments above.
@nezihyigitbasi I'm also very interested in this one, thanks for the hard work! Can you share some numbers from the benchmark? And how much work left in order to use the vectorvised interface in ParquetInputFormat (for getting the performance improvements out of box)?
@davies in my microbenchmarks I saw ~2x improvement for primitive types (since for primitives the current implementation bypasses the converters) while the performance was roughly the same for complex types. As for the readiness of the implementation, we are using it in production at Netflix (with Presto) for some time. And regarding the readiness of this PR, I am still working on addressing the comments.
@julienledem Thanks for the comments, I have made a first pass over them. Regarding reducing the responsibility of RecordReader, I moved the vector-related interface to a new interface (VectorizedRecordReader, see e3feae1) and created a new VectorizedRecordReaderImplementation class that implements it, but I still keep the implementation in RecordReaderImplementation as I didn't want VectorizedRecordReaderImplementation to replicate the existing state management logic.
If we really want to get rid of any vector logic in RecordReaderImplementation, one option is to replicate the class hierarchy of the non-vector reads for the vector reads (have a Vectorized* for every interface/class such as record reader, record reader impl, parquet reader, etc.). But, this option may perform worse for primitive types if we simply wrap RecordReaderImplementation::read() in the vectorized readers (currently for primitives we don't simply wrap the read() method, instead we read the primitives directly using the column readers). Do you have any suggestions?
@nezihyigitbasi That's great, thanks!
@julienledem I have addressed most of your comments and fixed the unit tests. Can you take another look?
Thanks for all the work on this @nezihyigitbasi!
I took a close look at this yesterday and I think there are a couple of things we should change. The VectorizedReader and supporting classes, like RowBatch, mix two modes of operation that I think we should keep separate: accessing the underlying columns in vectors and deserializing and accessing converted records in batches. I don't see a good reason to mix the two together and it complicates the APIs: what happens when the vectorized and non-vectorized calls are mixed? What about when batch object method calls are mixed with column-access calls?
Separating the those two would be a good thing for the API. For example, it is awkward to use RowBatch to hold a batch of objects because the type information is lost. Similarly, I think the right level for accessing vectors of column data is from the ColumnReadStore rather than through the high-level readers. What about adding a ColumnReadStore#nextVectorBatch() method or similar?
How does this handle nesting? It looks like in RecordReaderImplementation, the definition level is checked so you get a vector of nulls and values in the right places, but the repetition level info seems to be ignored. I think that means this would only work with flat columns.
I think we could easily add repetition info for the vector data. Reassembly would be possible if the total number of records in a vector was counted by the number of root records. That would make vectors more complicated: we would have to either cut off a vector at fewer than 1024 root records or use more tables as needed to fit all of the data. Still, I think we should account for nested records.