seatunnel [Improve][Connector-V2][File] Speed up more than 2x for writing orc file.

[Improve][Connector-V2][File] Speed up more than 2x for writing orc file.

Open kyehe opened this issue 2 years ago • 2 comments

Tested Job For MySQL2Hive

mysql source: 2.4kw rows
env config (what we need is Faster Speed With Fewer Resources) env { spark.app.name = "SeaTunnel Spark Job" spark.dynamicAllocation.enabled = false spark.executor.instances = 1 spark.executor.cores = 1 spark.executor.memory = "2g" spark.driver.memory = "1g" spark.dynamicAllocation.minExecutors = 1 spark.executor.memoryOverhead = 1g spark.executor.heartbeatInterval = 60s }

Optimized Before: Job Runs 15min

we can see the sink writer is too slower than source reader...

Optimized After: Job Runs 3min

since we used batch-rows insert rather than one-row insert and flush the temp file once it was written done.

we can see the sink writer can not only speed up consume the source data, but it also can avoid executor task failed because of out-of-memory if the job resource conf is not enough (for tested job: we just set 2g of executor and 1g of driver).

Purpose of this pull request

Check list

[ ] Code changed are covered with tests, or it does not need tests for reason:
[ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
[ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/incubator-seatunnel/tree/dev/docs
[ ] If you are contributing the connector code, please check that the following files are updated:
1. Update change log that in connector document. For more details you can refer to connector-v2
2. Update plugin-mapping.properties and add new connector information in it
3. Update the pom file of seatunnel-dist

Dec 15 '22 14:12 kyehe

Please fix code style by this doc：

https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md

Dec 15 '22 17:12 TaoZex

Please fix code style by this doc：

https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md

ok~ I recommitted

Dec 16 '22 07:12 kyehe

Please merge code for pass CI.

Jan 08 '23 10:01 TaoZex

Sorry, I have some doubts about this. In no partition data, this solution is work, but when write partition data in orc files, maybe different rows need be wrote to different files, so writer object will be changed, row batch does not make sense.

Jan 08 '23 11:01 TyrantLucifer

seatunnel seatunnel copied to clipboard

[Improve][Connector-V2][File] Speed up more than 2x for writing orc file.

Purpose of this pull request

Check list

seatunnel
seatunnel copied to clipboard