Add ensemble relocation command which adheres to placement policy
Motivation
Currently, we can't recover the ensemble which isn't adhering to placement policy directly.
The recover command or autorecovery process is one of the alternative approaches. Unfortunately, however, we can't.
For example, suppose
- the bookie cluster has two racks(
/region/rack1,/region/rack2) - target ledger has the ensemble (
[bookie1(/region/rack1), bookie2(/region/rack1)]) - the replicator uses RackawareEnsemblePlacementPolicy
When bookie1 or bookie2 goes down, then run autorecovery process.
If the bookie cluster has remained bookies which place to /region/rack1, the ledger recovers to the /region/rack1 bookie again.
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/ReplicationWorker.java#L225-L246
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1055-L1103
https://github.com/apache/bookkeeper/blob/release-4.14.4/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L474-L481
Therefore, we can't recover the ensemble which isn't adhering to placement policy directly.
The recover command is similar to this behavior.
I want to add a relocation feature to adhere placement policy.
Changes
Add relocation feature to shell command. So, we can use it synchronously instead of executing asynchronously like the autorecovery process.
In this specification, optimize the new ensemble arrangement to minify the number of bookies replaced. Therefore, we should implement the new method that returns the new ensemble by policies. First, I introduce this feature to RackawareEnsemblePlacementPolicy in this PR.
I've created the issue about the failing OWASP Dependency Check.
https://github.com/apache/bookkeeper/issues/3134
@eolivelli Addressed your first comments. PTAL. (Also rebase to current master to follow https://github.com/apache/bookkeeper/pull/3140 and https://github.com/apache/bookkeeper/pull/3149 .)
rerun failure checks
@eolivelli Addressed your comments. PTAL.
@eolivelli ping
I have rather generic comments.
If the bookie cluster has remained bookies which place to /region/rack1, the ledger recovers to the /region/rack1 bookie again.
As I understand, there is enforceMinNumRacksPerWriteQuorum option to prevent that (and enforceMinNumFaultDomainsForWrite, enforceStrictZoneawarePlacement for other placement policies)
As I understand, Auditor's placementPolicyCheck just detects the problem. Maybe it makes sense to make Auditor (optionally) put the ledgers with bad placement for re-replication and make AutoRecovery handle that? In this case the CLI command is already existing triggeraudit.
Another note is that the test only covers rack-aware policy. What happens in case of the region-aware or zone-aware policies?
Also: https://github.com/apache/bookkeeper/pull/3359
Also: https://github.com/apache/bookkeeper/pull/3359
Thank you for sharing. I'll check it first.
@dlg99 https://github.com/apache/bookkeeper/pull/2931#issuecomment-1190559164
Sorry for the late reply. I've checked https://github.com/apache/bookkeeper/pull/3359#pullrequestreview-1061219674 . The approach (using autorecovery) sounds better. If it was merged and still has the above issue, I'll rebase to follow https://github.com/apache/bookkeeper/pull/3359 interfaces.
As I understand, there is enforceMinNumRacksPerWriteQuorum option to prevent that (and enforceMinNumFaultDomainsForWrite, enforceStrictZoneawarePlacement for other placement policies)
In my understanding, if we use enforceMinNumRacksPerWriteQuorum , we can't recover quorum temporally for redundancy when a rack goes down. I want to avoid it.
As I understand, Auditor's placementPolicyCheck just detects the problem. Maybe it makes sense to make Auditor (optionally) put the ledgers with bad placement for re-replication and make AutoRecovery handle that? In this case the CLI command is already existing triggeraudit.
I think so too. I thought the next step is asynchronous relocation, like the auto-recovery feature (mentioned in here). However, it was implemented in https://github.com/apache/bookkeeper/pull/3359 .
Another note is that the test only covers rack-aware policy. What happens in the case of the region-aware or zone-aware policies?
When using non-Rackaware, it throws an Exception. I'll add the test. https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/EnsemblePlacementPolicy.java#L460-L467
I want to use the EnsemblePlacementPolicy#replaceToAdherePlacementPolicy in #3359, it's more graceful, and I will enhance it in some cases.
@horizonzy
I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?
Currently, maybe, we can't do it. This feature expects all of the bookies were already recovered to a normal state. However, your point is correct.
I want to use the EnsemblePlacementPolicy#replaceToAdherePlacementPolicy in https://github.com/apache/bookkeeper/pull/3359, it's more graceful, and I will enhance it in some cases.
It's okay. Just confirm, do you mean "bring this feature to https://github.com/apache/bookkeeper/pull/3359 and improve it myself"? Or improve it in this PR?
Improve it both #3359 and this pr. I think the improvement is common.
I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?
Currently, maybe, we can't do it. This feature expects all of the bookies were already recovered to a normal state. However, your point is correct.
I've checked the current code and found handing. If the enforceMinNumRacksPerWriteQuorum is true and the default rack name is /default-region/default-rack then handle in below.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L1089-L1090
If the default rack name is /default-rack, then we can't add it.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L138
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/net/NetworkTopologyImpl.java#L416-L420
However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.
Moreover, EnsemblePlacementPolicy#isEnsembleAdheringToPlacementPolicy has a similar process like below.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L1043
I notice that the implements didn’t handle the bookie which already shutdown. The shutdown bookie will be as the defaultRack, can we replace it?
If the default rack name is /default-rack, then we can't add it. https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/net/NetworkTopologyImpl.java#L416-L420
I means the defaultRack is not literal, I means the defaultRack is /default-region/default-rack. 😂
However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.
Yes, If we already knows which bookie shutdown, I can add it to excludeNodes to ignore it.
In my understanding, currently, defaultRack of Rackaware is /default-rack. So we should consider about it if use it.
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L138
https://github.com/equanz/bookkeeper/blob/62970b3122fa1f8266763f0d3f6115a066fcabe9/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RackawareEnsemblePlacementPolicyImpl.java#L234-L248
However, it is not expected of me. I'll modify to add to excludeNodes regardless of enforceMinNumRacksPerWriteQuorum value.
Yes, If we already knows which bookie shutdown, I can add it to excludeNodes to ignore it.
Could you tell me more about shutdown? I think the resolver doesn't return /default-region/default-rack just call TopologyAwareEnsemblePlacementPolicy#handleBookiesThatJoined.
In my understanding, currently, defaultRack of Rackaware is
/default-rack. So we should consider about it if use it.
Yes, you are right. If the test case, we changed the default rack to /defualt-region/default-rack.
repp.withDefaultRack(NetworkTopology.DEFAULT_REGION_AND_RACK);
If we didn't change it, the default rack is /default-rack.
fix old workflow,please see #3455 for detail
Part of the code are ported to https://github.com/apache/bookkeeper/pull/3359 and it was merged. So, I'll close this PR.