Vectorized Reader In Parquet
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed. As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.
Detail proposal: https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30
Reporter: Zhenxiao Luo / @zhenxiao Assignee: Zhenxiao Luo / @zhenxiao
Subtasks:
- [ ] [Vectorized Reader] Support Complex Types (Map, Array, Struct) in Parquet Vectorized Reader
- [ ] [Vectorized Reader] ColumnVector length should be in terms of rows, not DataPages
- [ ] [Vectorized Reader] Make sure all encodings work in Parquet Vectorized Reader
- [ ] [Vectorized Reader] Lazy Load in Vectorized Reader
- [ ] [Vectorized Reader] Lazy Decoding in Vectorized Reader
- [ ] [Vectorized Reader] Add Testcases/Benchmarks for ParquetVectorizedReader
- [ ] [Vectorized Reader] Add attributes in ColumnVector and RowBatch
Related issues:
- Improve Parquet Vectorization (is related to)
Original Issue Attachments:
Note: This issue was originally created as PARQUET-131. Please see the migration documentation for further details.
Brock Noland / @brockn: Hi,
Thank you very much for creating this! I sincerely appreciate you taking the type to create this proposal!
From the Hive side, I have the following feedback:
My understanding is that ColumnVector is an interface so we can provide our own impl. This will be required for Hive since we have our own ColumnVector impl and it's extremely widely used. I don't think this version of the ColumnVector interface will provide pluggability for the following reasons:
- Impls e.g
LongVectorhave public members. This same thing was done in Hive (not use getters and setters) but IMO for dubious reasons. No proof was provided that shows JIT does not optimize the getters setters out. - Drill, Hive, etc will be required to extend
LongVectorin order to make this work, but that would require massive change on the Hive side. We should provide getters and setters on the interface for the data types so that Hive can simply implement theColumnVectorinterface with our existing implementation. We might also need to provideisLongVectormethods so we know the type of theColumnVector. - I don't understanding why
ColumnVectorhas angetEncoding. Isn't an encoding a storage feature not a column vector feature?
Zhenxiao Luo / @zhenxiao: @brockn Thanks a lot for the comment. I will update the ColumnVector interface, with getters and setters, and getTypes. And all primitive type Vectors are left for implementations. Yes, we are keeping Encoding information in ColumnVector for Presto do lazy materialization. Presto does not materialize the vector until it finishes the filer. It is OK not using this Encoding information.
Jacques Nadeau / @jacques-n: Few thoughts:
- I agree with Brock's general comments about having to avoid a parquet canonical representation of the in memory data structure.
- For the getter/setters, we need to support bulk transfer and primitive transfer.
- We should avoid copy unless necessary. For example, in Drill we often choose to avoid copying the variable length data, instead choosing to use it as is.
- The interface should also take in column level filter expression evaluator. Again, this should be a no copy interface. While you may think that with vectorized reads, this isn't necessary, we've found that it actually depends entirely on the selectivity of the filter and whether you are using dictionary encoding.
I also would suggest that this be a replacement for the lower layers of the Parquet reader rather than a secondary path. Otherwise, we're always going to have a partial implementation. We're very engaged in trying to think through the ideas here and are definitely going to be pushing this along.
One last thought, I'm not entirely convinced that this should be a column at a time interface. I've been thinking that a batch of records at a time is more appropriate. Otherwise, there are too many internal concerns that have to be reimplemented and fancy inter column behaviors have to be implemented multiple times (as well as complex data support). On the flip side I'm not sure any other engines currently have vectorized readers for complex data but I think we're more than happy to push it that direction alone and people can fall back to a higher-level non-vectorized read interface for complex data.
Zhenxiao Luo / @zhenxiao: @brockn The gist is updated with ColumnVector interface. We are still discussing with the Drill team about whether to use Primitive Arrays, or ByteBuffer, or byte[] for setters and getters. @jacques-n I just updated the gist with a ByteBuffer, hoping both Drill, Hive and Presto could use this kind of generalized ByteBuffer. I will spend time reading Drill's code to see other magics. While, some relevant articles seems showing ByteBuffer is not as efficient/fast as primitive arrays: http://www.evanjones.ca/software/java-bytebuffers.html https://groups.google.com/forum/#!topic/mechanical-sympathy/9I18sXm4bvY http://imranrashid.com/posts/profiling-bytebuffers/ Still thinking primitive arrays could be the most efficient way. Anyway, let's continue discussing about it.
Dong Chen / @dongc: Hi @zhenxiao, @brockn, @jacques-n I am working on HIVE-8128 and find here. Thank you for creating and discussing this. From Hive perspective, hope below feedback could help.
- The current code implementation of vectorization in Hive for reference.
- In HIVE-4160, it mainly use data structure
VectorizedRowBatchandColumnVectorto feed the vectorized sql engine. - For
VectorizedRowBatch, it has an array ofColumnVectorto hold data of each column. And has an int size to indicate the number of rows in this batch. - For
ColumnVector, it has some booleans like noNulls and isRepeating, which help the engine skip some data. Also its subclass representing concrete type (e.g. Long) holds an array of primitive data. - To generate the
VectorizedRowBatch, the reader (of ORC file) was added a new method nextBatch(), which delegate each column to its type-suitable vectorized reader to load data. Similiar with the VectorReader in Zhenxiao's design.
- A few thoughs.
- I agree with Jacques's comment about build a batch of records at a time. Maybe a class
ParquetRowBatchcould be added to hold the columns. ColumnVectorcould has the boolean indicators about null or repeating values, as they are computed and set during extracting and building data from storage layer. These values in vector provide useful info to sql engines.- How about give a length to the VectorReader? Maybe there is a possibility that sql engine wants to specify the rows fetched in a batch.
- A rough idea: add a readPatch() method in
InternalParquetRecordReader<T>. And when vector mode is on, reader will invoke this method to getParquetRowBatch. The sql engine like Hive, Drill will convert this batch to the type they need. Primitive arrays in the vectors of batch might make conversion efficiently. The conversion procedure is reading values inParquetRowBatchand set them to XxxRowBatch object, which is sent to sql engine.
I will keep going on this work to make things more detailed and joining the discussion.
Brock Noland / @brockn:
@dongc, I believe in Hive ColumnVector.isRepeating is used when it's a partition column. Is that your understanding?
Dong Chen / @dongc: Hi @brockn
in Hive ColumnVector.isRepeating is used when it's a partition column
I think using for a partition column is one case, and it can also used for normal columns, if the values are same in one column.
In Hive, when the VectorExpression in Operator consume the ColumnVector, it will check isRepeating first. If true, it will skip the array loop and just fetch the 1st element in the vector for computation.
Dong Chen / @dongc: Hi,
After digging into the code more, I get some thoughts based on the design proposal.
In order to describe these thoughts clearly, I uploaded a doc Parquet-Vectorized-APIs.pdf.
The general ideas are:
- Parquet internal readers read one row at a time now. I think we don't have to add a series of Readers for vectorization. Maybe we could use these existed readers and just add methods like
readBatch(T next, int size). - ColumnReader.Binding is responsible for binding low level ValuesReader to the customized record Converter materializing records. We can add new concrete Binding classes in Parquet and new customized Converter classes in SQL engine like Hive, Drill. Then the loaded raw primitive data could be materialized to records in the representation SQL engine expecting. This solution could decouple Parquet iterative raw data reading and SQL engine vectorized records materialization. Parquet will not have to organize the primitive data by itself. It just load the data iteratively for vectorization usage. SQL engines could organize the data as they like.
Zhenxiao Luo / @zhenxiao: Briefly describe how Presto works with Parquet
Brock Noland / @brockn: Hello @dongc,
Today the informal elements of the "parquet vectorization" team in the US met. This included myself, Zhenxiao, Daniel, and Eva for PrestoDB, Parth Jason for Drill, and @spena and myself for Hive. I of course thought to invite you but the rest of the team wanted an on-site and I know it's very late in China...
Questions
Why does presto read api specify ColumnVector. Do they read one column at a time?
Presto has code which reads all columns in a loop, thus they don't need the batch API.
Original API specified encoding, does the reader use the encoding to materialize?
ColumnVector will not expose Encoding and won’t materialize values until getter is called or initialize is called.
Does presto DB use ByteBuffers or primitive arrays (long[], etc)?
They use primitive arrays, like Hive. Drill uses native Buffers.
If API is not going to materialize and gives back raw Buffer, is there any strategy for converting that to long array without copying?
We’ll pass in allocator which allocates appropriate Buffer type. Presto and Hive will allocate instances of for exmaple LongBuffer which gives us access to the primitive array.
Next Steps
- Update interface to remove Encoding as getters will materialize
- Add allocator interface
- Netflix will hack together POC (Drill and Hive might do POC on top of this POC)
- GSOC byte buffer patch is a requirement, thus we should merge soon.
- Finish implementation of Parquet Vector* classes (part of POC)
- Finish Drill, Presto and Hive implementation
@danielcweeks - in the meeting it was said that merging the GSOC buffer patch (PR 49?) depended on doing some parquet releases such as mr 1.6/1.7 and format 2.0. I chatted with @rdblue and he wasn't sure what that would be?
Brock Noland / @brockn: @danielcweeks actually looks like PR 6. Any thoughts on why you feel releases are required?
Dong Chen / @dongc: Thanks @brockn, PrestoDB team, and Drill team for the progress and plan! Adding the allocator interface is a good idea. Looking forward to the POC. And hope I could help on Hive part then.
Hyunsik Choi: Hi guys,
I'm a Tajo (tajo.apache.org) guy. We would also like to participate in this work. Is the current progress POC? We'll share the progress too here.
Zhenxiao Luo / @zhenxiao: @jaltekruse @parthchandra Do you have the allocator interface in one of your PRs? May I get a reference to the PR? Thanks.
Parth Chandra / @parthchandra: The pull request is here : https://github.com/apache/incubator-parquet-mr/pull/50
Nezih Yigitbasi / @nezihyigitbasi: Hi all, Some time ago I have sent a message to the parquet dev mailing list about our efforts regarding vector support. I also want to share it here in case some of you have missed it. Even though it's still early work-in-progress any feedback is welcome: https://github.com/zhenxiao/incubator-parquet-mr/pull/1
Thanks
Dong Chen / @dongc: Hi @nezihyigitbasi, The work looks good! I built a HIVE POC (HIVE-8128) based on it and worked. I lest some feedback in the PR: https://github.com/zhenxiao/incubator-parquet-mr/pull/1
P.S. The code in this PR seems already to be merged. Sorry if I left the message in wrong place.
Nezih Yigitbasi / @nezihyigitbasi: Hi @dongc, thanks for the feedback and glad that it worked for your Hive POC.
Zhenxiao Luo / @zhenxiao: rebased vectorized parquet code against current master: https://github.com/zhenxiao/incubator-parquet-mr/tree/vector
Nezih Yigitbasi / @nezihyigitbasi: Created a PR for the initial implementation of the vectorized reader: https://github.com/apache/parquet-mr/pull/257
Nezih Yigitbasi / @nezihyigitbasi: @dongc It's still WIP and there is some work to get it merged.
Zoltan Ivanfi / @zivanfi: Hi,
I see that recently there has been no activity on this JIRA. I wonder whether it is due to the lack of time/interest or was there a technical barrier that stopped progress? Also, how do you perceive the state of uncommitted pull requests? Are they worth revisiting or should the feature be rebuilt from the ground up?
Thanks,
Zoltan
Flavio Pompermaier: Any news on this?