seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Improve][Connector-V2][File] Speed up more than 2x for writing orc file.

Open kyehe opened this issue 2 years ago • 2 comments

Tested Job For MySQL2Hive

  • mysql source: 2.4kw rows
  • env config (what we need is Faster Speed With Fewer Resources) env { spark.app.name = "SeaTunnel Spark Job" spark.dynamicAllocation.enabled = false spark.executor.instances = 1 spark.executor.cores = 1 spark.executor.memory = "2g" spark.driver.memory = "1g" spark.dynamicAllocation.minExecutors = 1 spark.executor.memoryOverhead = 1g spark.executor.heartbeatInterval = 60s }

Optimized Before: Job Runs 15min

image

image

we can see the sink writer is too slower than source reader...

image

Optimized After: Job Runs 3min

image

image

since we used batch-rows insert rather than one-row insert and flush the temp file once it was written done.

we can see the sink writer can not only speed up consume the source data, but it also can avoid executor task failed because of out-of-memory if the job resource conf is not enough (for tested job: we just set 2g of executor and 1g of driver).

image

Purpose of this pull request

Check list

  • [ ] Code changed are covered with tests, or it does not need tests for reason:
  • [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
  • [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/incubator-seatunnel/tree/dev/docs
  • [ ] If you are contributing the connector code, please check that the following files are updated:
    1. Update change log that in connector document. For more details you can refer to connector-v2
    2. Update plugin-mapping.properties and add new connector information in it
    3. Update the pom file of seatunnel-dist

kyehe avatar Dec 15 '22 14:12 kyehe

Please fix code style by this doc:

https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md

TaoZex avatar Dec 15 '22 17:12 TaoZex

Please fix code style by this doc:

https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/setup.md

ok~ I recommitted

kyehe avatar Dec 16 '22 07:12 kyehe

Please merge code for pass CI.

TaoZex avatar Jan 08 '23 10:01 TaoZex

Sorry, I have some doubts about this. In no partition data, this solution is work, but when write partition data in orc files, maybe different rows need be wrote to different files, so writer object will be changed, row batch does not make sense.

TyrantLucifer avatar Jan 08 '23 11:01 TyrantLucifer