ozone HDDS-9889. Configure adaptation for datanode limits in ContainerBalancer

Dynamical adaptation (introduced in HDDS-5526) for datanodes.involved.max.percentage.per.iteration in container balancer doesn't work well in some cases.

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped (see comment).

The issue it can be easily detected when cluster has few nodes (< 10), for example 4 or 5. To fix this case we have to set datanodes.involved.max.percentage.per.iteration value to 100.

What changes were proposed in this pull request?

Rework some test cases to use clusters with a small number of datanodes.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9889

How was this patch tested?

hdds.scm.container.balancer.TestContainerBalancerTask is reworked:

Extracted two classes

hdds.scm.container.balancer.MockedSCM for setting up testable hdds.scm.server.StorageContainerManager
hdds.scm.container.balancer.TestableCluster for creating test cluster with a required number of datanodes

Add TestContainerBalancerDatanodeNodeLimit test with 3 test extracted from TestContainerBalancerTask with cluster with different node count.

Dec 11 '23 09:12 Montura

@JacksonYao287 , @sumitagrawl, please review the changes

Dec 11 '23 11:12 Montura

@Montura please merge latest master into your branch, the compile error (not caused by this PR) is fixed in 582a5cec45449caf13f341d9ad59201fcca2902c

Dec 11 '23 17:12 adoroszlai

@adoroszlai, UPD: done!

Dec 11 '23 17:12 Montura

UPD: Today I rebased this PR on the master branch (to get the latest changes)

Dec 15 '23 08:12 Montura

UPD: Today I rebased this PR on the master branch (to get the latest changes)

Thanks @Montura. Only need to update from master if there is a conflict, or failing checks need code from master. Also, please use merge, not rebase.

@siddhantsangwan @sumitagrawl can you please review?

Dec 15 '23 08:12 adoroszlai

@siddhantsangwan @sumitagrawl could you please review?

Dec 20 '23 11:12 Montura

UPD: Today I merged master branch in this PR to resolve conflicts

Dec 21 '23 09:12 Montura

@siddhantsangwan @sumitagrawl could you please review?

Dec 26 '23 13:12 Montura

@Montura please keep in mind that end of year is usually a holiday season in many places

Dec 26 '23 13:12 adoroszlai

@Montura Sorry about the code conflicts. This PR does not allow edit from maintainers, is that intentional? If it did, I'd try to keep it updated after merging PRs touching the same files.

Jan 08 '24 13:01 adoroszlai

@siddhantsangwan @sumitagrawl please take a look at the patch, to provide high-level feedback until conflicts are resolved

Jan 08 '24 13:01 adoroszlai

@Montura Sorry about the code conflicts. This PR does not allow edit from maintainers, is that intentional? If it did, I'd try to keep it updated after merging PRs touching the same files.

Tomorrow I'll resolve conflicts, it wasn't intentional to forbid PR editing for maintainers. It's my first PR here, I'll do better next time.

Jan 08 '24 14:01 Montura

it wasn't intentional to forbid PR editing for maintainers. It's my first PR here, I'll do better next time.

No worries.

Jan 08 '24 15:01 adoroszlai

@Montura Thanks for working on this. I'm trying to understand the problem. If you wrote a test that fails without your fix, please point me to it.

Jan 09 '24 07:01 siddhantsangwan

@Montura Thanks for working on this. I'm trying to understand the problem. If you wrote a test that fails without your fix, please point me to it.

Sure, I'm merging current master now, when I finish, I'll point out to the test

Jan 09 '24 07:01 Montura

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped.

I didn't really get this. Can you please elaborate? It'd be helpful to have a small example where you describe this problem.

Jan 09 '24 07:01 siddhantsangwan

Sometimes the number of under-utilized nodes may not be sufficient to satisfy the limit about the max percent of datanodes participating in the balance iteration (datanodes.involved.max.percentage.per.iteration). Thus, collections of source and target datanodes are reset and balancing is skipped.

I didn't really get this. Can you please elaborate? It'd be helpful to have a small example where you describe this problem.

Let's imaging that you have a cluster with the total DNs number equals to any value of [4, 9] (4 or 5 or 6 or 7 or 8 or 9).

Then the maximum value of DN's that could be involved in balancing for that clusters will be 1, because default value for maxDatanodesRatioToInvolvePerIteration is 0.2 (20 %). So in the next two methods will skip balancing when DNs count is less then 10.

// ContainerBalancerTask#adaptWhenNearingIterationLimits
int maxDatanodesToInvolve =  config.getMaxDatanodesRatioToInvolvePerIteration() * totalNodesInCluster;
if (countDatanodesInvolvedPerIteration + 1 == maxDatanodesToInvolve) {
    // Restricts potential target datanodes to nodes that have already been selected
}


// ContainerBalancerTask#adaptOnReachingIterationLimits
int maxDatanodesToInvolve =  config.getMaxDatanodesRatioToInvolvePerIteration() * totalNodesInCluster;
if (countDatanodesInvolvedPerIteration  == maxDatanodesToInvolve) {
    // Restricts potential source and target datanodes to nodes that have already been selected
}

// 4  * 0.2  = (0.8 -> cast_to_int) 0;
// 5  * 0.2  = (1.0 -> cast_to_int) 1;
// 6  * 0.2  = (1.2 -> cast_to_int) 1;
// 7  * 0.2  = (1.4 -> cast_to_int) 1;
// 8  * 0.2  = (1.6 -> cast_to_int) 1;
// 9  * 0.2  = (1.8 -> cast_to_int) 1;
// 10 * 0.2  = (2.0 -> cast_to_int) 2;

By spec java primitive narrowing conversion, the floating-point value is rounded to an integer value V, rounding toward zero using IEEE 754 round-toward-zero mode (§4.2.3)

The Java programming language uses round toward zero when converting a floating value to an integer (§5.1.3), which acts, in this case, as though the number were truncated, discarding the mantissa bits. Rounding toward zero chooses as its result the format's value closest to and no greater in magnitude than the infinitely precise result.

So we got under-utilized nodes that will never take part in balancing at all. All clusters with DNs count > 3 and < 10 will start balancing and do nothing because of action in ContainerBalancerTask#adaptWhenNearingIterationLimits method

Jan 09 '24 09:01 Montura

Ah, I understand. Yes, I've seen this happen in some small clusters. The recommendation is to increase the value of datanodes.involved.max.percentage.per.iteration accordingly. For example, it can be set to 100 for clusters of 15 Datanodes or less so that all Datanodes may be involved in balancing. Do you have any reason to not do this and make a code change instead? It doesn't make sense to have a configuration datanodes.involved.max.percentage.per.iteration which imposes a limit, and then have another configuration adapt.balance.when.reach.the.limit which will effectively disable the former limit. Why not just change datanodes.involved.max.percentage.per.iteration?

Jan 09 '24 10:01 siddhantsangwan

Ok, make it sense.

Let me rewrite the tests to verify desired behavior by increase the value of datanodes.involved.max.percentage.per.iteration. And I revert the changes about properties in hdds.scm.container.balancer.ContainerBalancerConfiguration.

What do you think?

UPD: I've updated PR with using datanodes.involved.max.percentage.per.iteration property in containerBalancerShouldObeyMaxDatanodesToInvolveLimit test

Jan 09 '24 10:01 Montura

@siddhantsangwan, what about current changes? Do you agree with them?

Jan 10 '24 12:01 Montura

@siddhantsangwan, sorry about duplicated request: what about current changes? Do you agree with them?

Jan 12 '24 08:01 Montura

Ok, I'll remove all the refactorings in ContainerBalancer, ContainerBalancerTask and TestContainerBalancerTask, and I will have to add some duplicated code in a new classes MockedSCM and TestableCluster. In the next PR I'll refactor TestContainerBalancerTask to remove duplicates and reuse MockedSCM instances. Is it ok?

Jan 19 '24 07:01 Montura

@Montura Is it possible to just add the new test using the existing testing framework in the current PR, while other improvements/refactoring can be done in future jiras?

Jan 19 '24 07:01 siddhantsangwan

@Montura Is it possible to just add the new test using the existing testing framework in the current PR, while other improvements/refactoring can be done in future jiras?

It can be done in a very ugly way. Because I need to run N testable clusters with different number of DNs, so each cluster encapsulates ContainerBalancer and ContainerBalancerTask too. For now only the field ContainerBalancerTask is sharing between the all the tests and cluster with hardcoded 10 DNs is initialized by default before each test.

Maybe I can create a new TestContainerBalancerTaskDatanodeNodeLimit file , move some test code from TestContainerBalancerTask and use a new framework?

In next PR I'll merge these two tests with a new test framework. Actually there is no new test framework. I just move mocking code to a MockedSCM and code related to cluster initialization to a TestableCluster. That is it.

Jan 19 '24 07:01 Montura

@siddhantsangwan , take a look please

Jan 25 '24 13:01 Montura

@adoroszlai, sure, I'll change it

Jan 28 '24 17:01 Montura

@Montura Thanks for updating. Will try to review asap!

Jan 29 '24 11:01 siddhantsangwan

@siddhantsangwan, sorry for reminder, do you have any updates?

Feb 19 '24 10:02 Montura

@siddhantsangwan, sorry for reminder once more, do you have any updates?

Feb 27 '24 18:02 Montura

@siddhantsangwan, sorry for reminder once more, could you approve this changes?

Mar 04 '24 06:03 Montura

ozone ozone copied to clipboard

HDDS-9889. Configure adaptation for datanode limits in ContainerBalancer

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

ozone
ozone copied to clipboard