hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-3384][HUDI-3385] Spark specific file reader/writer.

Open minihippo opened this issue 2 years ago • 15 comments

What is the purpose of the pull request

RFC-46 spark specific file reader/writer based on internal row

Brief change log

add spark file reader of parquet/orc/HFile add spark file writer of parquet/orc/HFile

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • [ ] Has a corresponding JIRA in PR title & commit

  • [ ] Commit message is descriptive of the change

  • [ ] CI is green

  • [ ] Necessary doc changes done or have another open PR

  • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

minihippo avatar May 19 '22 02:05 minihippo

A huge patch, may need some careful review :)

danny0405 avatar May 19 '22 12:05 danny0405

@hudi-bot run azure

wzx140 avatar Jun 14 '22 01:06 wzx140

@hudi-bot run azure

minihippo avatar Jun 14 '22 09:06 minihippo

@hudi-bot run azure

minihippo avatar Jun 14 '22 13:06 minihippo

@hudi-bot run azure

wzx140 avatar Jun 20 '22 08:06 wzx140

Is this PR going to be merged into 0.12?I feel this is too big. A refactoring of this magnitude would have an impact.

XuQianJin-Stars avatar Jul 14 '22 15:07 XuQianJin-Stars

@xushiyan @vinothchandar Please take a moment to look at this pr. I have revised it according to the feedback.

wzx140 avatar Jul 21 '22 07:07 wzx140

How much gains do we get after the patch, feel nervous about this change, let's pay more attention about stability first.

danny0405 avatar Jul 21 '22 09:07 danny0405

How much gains do we get after the patch, feel nervous about this change, let's pay more attention about stability first.

@danny0405 The current benchmark result of RFC-46:

  • no overhead introduced of avro log read/write with new merger
  • Compare with avro record in reading/writing parquet log
    • for read, spark record improves ~33% (after e2e duration-before e2e duration)*before
    • for write, spark record improves 7%~9%. In the stage that transforms input to hoodie record, it improves ~25%. But the shuffle data becomes bigger that before. It costs more in the following shuffle read stage.

All based on spark batch write/read

minihippo avatar Jul 21 '22 16:07 minihippo

Not very persuaded by the improvement number: read 33% and write 9%, if the number is real and can be re-productive, i would suggest to lower priority of the patch, for example, after release 1.0.0.

I had expected about 5x ~ 10x performance improvement, BTW.

danny0405 avatar Jul 25 '22 06:07 danny0405

@danny0405 a few considerations we need to keep in mind here:

  1. RFC-46 is a stepping stone for transitioning from our current "modus operandi" with intermediate representation (Avro) to a state where we'd completely hybrid in relying on engine-specific containers (Dataset/RDD for Spark, for ex) as well as Data representation formats (InternalRow for Spark, for ex). This change is very critical first step in that direction of decoupling Hudi fro Avro.
  2. Given how dynamic our code-base is we can't park this change for long. Even now after 2 months of dev, it's going to be a humongous effort to rebase it again onto the latest changes given how much have landed in these 2 months.

While i understand that we all expect radical improvements, we need to keep in mind that these will come when we reach the final state.

P.S. Also, BTW, we won't see 5x improvements, it's gonna be more like up to 2x in the best case simply b/c Hudi is pretty tight in terms of performance across the board.

alexeykudinkin avatar Jul 25 '22 17:07 alexeykudinkin

This change is very critical first step in that direction of decoupling Hudi fro Avro

I agree we should decouple Hudi from Avro, but that does not mean we should lean back to engine-specific data structures which is very hard to maintain as a engine neutral project, see how hard it is for hudi to integrate with a new engine now :), i kind of expect hudi's own reader/writer/data structures, which is the right direction we should elaborate with.

Given how dynamic our code-base is

I know the code dev is hard because it is a core change which is very influential and piecemeal changes. But that should not be a critical prove we should fight forward with this direction.

And another concern i always have in my mind is hudi needs a stable release tooo much ! We can not make huge changes to core reader/writers now at this moment before we do enough tests/practice, and we should not rush in the code for just the reason of code rebase effort.

danny0405 avatar Jul 26 '22 06:07 danny0405

@danny0405

I agree we should decouple Hudi from Avro, but that does not mean we should lean back to engine-specific data structures which is very hard to maintain as a engine neutral project, see how hard it is for hudi to integrate with a new engine now :), i kind of expect hudi's own reader/writer/data structures, which is the right direction we should elaborate with.

I don't think we are aligned on this one: Hudi is and will be staying engine-neutral project. However for the top-of-the-line performance on any engine (to stay competitive with other formats) we have to use engine-specific representations (think Row, RowData, ArrayWritable, Arrow, etc). There's just no other way -- any intermediate representation will be a tax on performance, and general direction is that we want to provide best possible performance in any supported workload be it a a read or write.

And another concern i always have in my mind is hudi needs a stable release tooo much ! We can not make huge changes to core reader/writers now at this moment before we do enough tests/practice, and we should not rush in the code for just the reason of code rebase effort.

Totally agree with you there, and it's one of the reasons why we decided that it's a good idea to take a more measured approach here and avoid pushing really hard (and compromising on quality testing) to meet 0.12 deadline.

alexeykudinkin avatar Jul 26 '22 20:07 alexeykudinkin

I don't think we are aligned on this one: Hudi is and will be staying engine-neutral project. However for the top-of-the-line performance on any engine (to stay competitive with other formats) we have to use engine-specific representations (think Row, RowData, ArrayWritable, Arrow, etc). There's just no other way

I do have some different thoughts, performance is on first priority if it is critical, say Hudi performs bad on some benchmark, but for the long run, i do think as a storage we should have our own data structures/reader writers/schema like every storage engine do, looks like how easy it is to a new engine to integrate with Iceberg and how hard it is for Hudi, the ease to integrate is important for the ecosystem especially as a format.

danny0405 avatar Aug 02 '22 06:08 danny0405

How much gains do we get after the patch, feel nervous about this change, let's pay more attention about stability first.

@danny0405 The current benchmark result of RFC-46:

  • no overhead introduced of avro log read/write with new merger

  • Compare with avro record in reading/writing parquet log

    • for read, spark record improves ~33% (after e2e duration-before e2e duration)*before
    • for write, spark record improves 7%~9%. In the stage that transforms input to hoodie record, it improves ~25%. But the shuffle data becomes bigger that before. It costs more in the following shuffle read stage.

All based on spark batch write/read

After adding map type to test dataset, we find that the e2e of writing improves 27%.

  • with further analysis, interalRow2parquet improves 40%, which matches with 50% CPU cost in avro2parquet got by flame graph.
  • The relationship between performance improvement and complex column num is not nonlinear. That means even if increase the complex column, the income will not increase.

To reproduce the benchmark result, I add the simple one to the test part

minihippo avatar Aug 30 '22 15:08 minihippo

CI report:

  • d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
  • 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
  • e8888f85e9b3eefd14a098a1bc5b277fd5989ef8 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 19 '22 21:09 hudi-bot

@alexeykudinkin I'm already rebased on master and add the config mergerStrategy with uuid. You can do final review.

wzx140 avatar Sep 20 '22 02:09 wzx140