inlong icon indicating copy to clipboard operation
inlong copied to clipboard

[INLONG-7072][Manager][Sort] Resource adaptive adjustment for Hudi

Open featzhang opened this issue 3 years ago • 3 comments

Prepare a Pull Request

[INLONG-7072][Manager][Sort] Resource adaptive adjustment for Hudi

  • Fixes #7072

Motivation

Hudi flink jobs often have unreasonable resource allocation. Too much allocation will lead to waste of resources, and too little will lead to back pressure or OOM.

When allocating resources, you first need to determine the concurrency of the source side to ensure that there is no data backlog in the upstream when reading. Here is a general configuration situation, such as partitioning by day, with about 15 billion data per day, and about 50 concurrent configurations. Other data volumes can be converted appropriately.

After determining the concurrency on the source side, you can configure the concurrency of write according to the ratio of 1:1.5 or 1:2.

If OOM occurs in the write operator during operation, you can appropriately add write concurrency and TM memory.

If the following back pressure occurs, the concurrency can be adjusted according to the consumption difference between source and write. As follows, there is a difference of about 50W, that is, there is 50W of data that cannot keep up with the write, and then it can be based on the amount of successfully written data and the running (used) Time to calculate how much write concurrency is needed to calculate the difference of 50W.

image image

Modifications

  1. Estimate the parallelism of the source node based on the estimated daily data volume input by the user at a rate of 1,000 per second per core.
  2. Configure write concurrency according to the ratio of 1:1.5 or 1:2

Verifying this change

(Please pick either of the following options)

  • [x] This change is a trivial rework/code cleanup without any test coverage.

  • [ ] This change is already covered by existing tests, such as: (please describe tests)

  • [ ] This change added tests and can be verified as follows:

    (example:)

    • Added integration tests for end-to-end deployment with large payloads (10MB)
    • Extended integration test for recovery after broker failure

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a follow-up issue for adding the documentation

featzhang avatar Dec 27 '22 15:12 featzhang

This PR is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Apr 04 '23 01:04 github-actions[bot]

This PR is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Oct 24 '23 01:10 github-actions[bot]

This PR is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Sep 18 '24 01:09 github-actions[bot]