hudi [HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

[ ] Has a corresponding JIRA in PR title & commit
[ ] Commit message is descriptive of the change
[ ] CI is green
[ ] Necessary doc changes done or have another open PR
[ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Mar 29 '22 07:03 xiarixiaoyao

LGTM

Mar 29 '22 13:03 XuQianJin-Stars

@xushiyan @alexeykudinkin @nsivabalan pls help me check this pr, thanks for incremental read parquet file, we can achive 2.x faster

Mar 31 '22 12:03 xiarixiaoyao

IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.

Apr 01 '22 13:04 garyli1019

IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.

@garyli1019 thanks for your comments. yes， i know the reason。so in this pr, I directly generated the code for the filter, vectorized the data, and directly filtered it after reading it

Apr 01 '22 14:04 xiarixiaoyao

@xiarixiaoyao can you post any test results for this effort.

Apr 01 '22 15:04 vinothchandar

@xiarixiaoyao Main contention here is to introduce a config and keep existing behavior. wdyt?

Apr 01 '22 16:04 vinothchandar

@xiarixiaoyao can you post any test results for this effort.

Sorry for the late reply， let me do it now

Apr 02 '22 06:04 xiarixiaoyao

@vinothchandar @alexeykudinkin @garyli1019 addressed all comments add config to keep original logical. benchmark setUp

orignal table: 1000w row, 10 columns,
insert 100w
query begin incremental from commit time 0 benchmark result:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_201-b09 on Windows 10 10.0
Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
perf mor incr read:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
vectorized disable                                 1730           1796          94          5.8         173.0       1.0X
vectorized enable                                   739            769          26         13.5          73.9       2.3X

do we need add benchmark code to this pr, if needed, i will added

Apr 02 '22 09:04 xiarixiaoyao

@xiarixiaoyao can you please rebase this one?

Also please fill out the description to explain the context.

Sep 13 '22 16:09 alexeykudinkin

@alexeykudinkin sorry to miss your message。 will update the code and add benmark result for it

Sep 22 '22 08:09 xiarixiaoyao

@xiarixiaoyao : gentle reminder.

Nov 03 '22 00:11 nsivabalan

@hudi-bot run azure

Nov 04 '22 04:11 xiarixiaoyao

@hudi-bot run azure

Nov 07 '22 02:11 xiarixiaoyao

@alexeykudinkin @nsivabalan rebase the codes， Add parameters to control this behavior, which is enabled by default

Nov 07 '22 02:11 xiarixiaoyao

@hudi-bot run azure

Nov 07 '22 11:11 xiarixiaoyao

CI report:

4beb69b67b9b8d39beb46e94782629f39d4faca2 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Nov 07 '22 23:11 hudi-bot

@xiarixiaoyao @alexeykudinkin why we need to disable enableVectorizedReader in mor inc query.

Nov 11 '22 02:11 YannByron

IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.

garyli1019 has explained the problem: IIRC the vectorized reader was disabled for a reason. somehow the filtering was depends on the regular parquet reader.

Personally,i think we should enable vectorization by default

Nov 11 '22 04:11 xiarixiaoyao

Personally,i think we should enable vectorization by default

+1. Users do not know when to enable vectorization. So provide a config to control this is meaningless.

Nov 11 '22 06:11 YannByron

@xiarixiaoyao : Can you address the comments in the PR ? @garyli1019 : Any other concern about having vectorization for incr query for MOR (with default turned off ? )

Apr 04 '23 04:04 bvaradar

@bvaradar only partial columns will return by the vectorized reader, but the merging with log might need some columns not in the select statement, or the necessary columns using by hudi, such as precombine field or primary key. How do we handle this case?

Apr 04 '23 11:04 garyli1019

Good point. It doesn't look like the place where we set spark vectorized reading on has knowledge of read schema to safely enable vectorization for certain cases for MOR. @xiarixiaoyao : Thoughts ?

Apr 05 '23 03:04 bvaradar

close this pr as we no longer need it

Apr 10 '23 09:04 xiarixiaoyao

hudi hudi copied to clipboard

[HUDI-3742] Enable parquet enableVectorizedReader for spark inc query to improve peformance

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

CI report:

hudi
hudi copied to clipboard