hudi
hudi copied to clipboard
[HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance
Tips
- Thank you very much for contributing to Apache Hudi.
- Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
- Modify AnnotationLocation checkstyle rule in checkstyle.xml
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end.
- Added HoodieClientWriteTest to verify the change.
- Manually verified the change by running a job locally.
Committer checklist
-
[ ] Has a corresponding JIRA in PR title & commit
-
[ ] Commit message is descriptive of the change
-
[ ] CI is green
-
[ ] Necessary doc changes done or have another open PR
-
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
LGTM
@xushiyan @alexeykudinkin @nsivabalan pls help me check this pr, thanks for incremental read parquet file, we can achive 2.x faster
IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.
IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.
@garyli1019 thanks for your comments. yes, i know the reason。so in this pr, I directly generated the code for the filter, vectorized the data, and directly filtered it after reading it
@xiarixiaoyao can you post any test results for this effort.
@xiarixiaoyao Main contention here is to introduce a config and keep existing behavior. wdyt?
@xiarixiaoyao can you post any test results for this effort.
Sorry for the late reply, let me do it now
@vinothchandar @alexeykudinkin @garyli1019 addressed all comments add config to keep original logical. benchmark setUp
- orignal table: 1000w row, 10 columns,
- insert 100w
- query begin incremental from commit time 0 benchmark result:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_201-b09 on Windows 10 10.0
Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
perf mor incr read: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
vectorized disable 1730 1796 94 5.8 173.0 1.0X
vectorized enable 739 769 26 13.5 73.9 2.3X
do we need add benchmark code to this pr, if needed, i will added
@xiarixiaoyao can you please rebase this one?
Also please fill out the description to explain the context.
@alexeykudinkin sorry to miss your message。 will update the code and add benmark result for it
@xiarixiaoyao : gentle reminder.
@hudi-bot run azure
@hudi-bot run azure
@alexeykudinkin @nsivabalan rebase the codes, Add parameters to control this behavior, which is enabled by default
@hudi-bot run azure
CI report:
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
@xiarixiaoyao @alexeykudinkin why we need to disable enableVectorizedReader in mor inc query.
IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.
garyli1019 has explained the problem: IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.
Personally,i think we should enable vectorization by default
Personally,i think we should enable vectorization by default
+1. Users do not know when to enable vectorization. So provide a config to control this is meaningless.
@xiarixiaoyao : Can you address the comments in the PR ? @garyli1019 : Any other concern about having vectorization for incr query for MOR (with default turned off ? )
@bvaradar only partial columns will return by the vectorized reader, but the merging with log might need some columns not in the select statement, or the necessary columns using by hudi, such as precombine field or primary key. How do we handle this case?
Good point. It doesn't look like the place where we set spark vectorized reading on has knowledge of read schema to safely enable vectorization for certain cases for MOR. @xiarixiaoyao : Thoughts ?
close this pr as we no longer need it