seatunnel
seatunnel copied to clipboard
[Improve][Connector-V2][File] Speed up more than 2x for writing orc file.
Tested Job For MySQL2Hive
- mysql source:
2.4kw
rows - env config (what we need is
Faster Speed With Fewer Resources
) env { spark.app.name = "SeaTunnel Spark Job" spark.dynamicAllocation.enabled = false spark.executor.instances = 1 spark.executor.cores = 1 spark.executor.memory = "2g" spark.driver.memory = "1g" spark.dynamicAllocation.minExecutors = 1 spark.executor.memoryOverhead = 1g spark.executor.heartbeatInterval = 60s }
Optimized Before: Job Runs 15min
we can see the sink writer is too slower than source reader...
Optimized After: Job Runs 3min
since we used batch-rows insert rather than one-row insert and flush the temp file once it was written done.
we can see the sink writer can not only speed up consume the source data, but it also can avoid executor task failed because of out-of-memory if the job resource conf is not enough (for tested job: we just set 2g of executor and 1g of driver).
Purpose of this pull request
Check list
- [ ] Code changed are covered with tests, or it does not need tests for reason:
- [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
- [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/incubator-seatunnel/tree/dev/docs
- [ ] If you are contributing the connector code, please check that the following files are updated:
- Update change log that in connector document. For more details you can refer to connector-v2
- Update plugin-mapping.properties and add new connector information in it
- Update the pom file of seatunnel-dist
Please fix code style by this doc:
https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md
Please fix code style by this doc:
https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md
ok~ I recommitted
Please merge code for pass CI.
Sorry, I have some doubts about this. In no partition data, this solution is work, but when write partition data in orc files, maybe different rows need be wrote to different files, so writer object will be changed, row batch does not make sense.