bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

BP-54: Repaired the ledger fragment which ensemble not adhere placement policy.

Open horizonzy opened this issue 2 years ago • 4 comments

Motivation

There is a user case about data availability.

  1. They have two racks, they have a rack aware policy that ensures it writes across two racks.
  2. They had some data on a topic with long retention
  3. They ran a disaster recovery(DR) test, during this test, they shut down one rack.
  4. During the period of the DR test, auto-recovery ran. Because the DR test only has one rack active, and because the default of auto-recovery is to do rack aware with the best effort, it recovered up to an expected number of replicas.
  5. They stopped the DR test and all was well, but now that ledger was only on one rack
  6. They ran another DR test, this time basically moving data to the another zone, but now data is missing because it is all only on one rack

We should supply a feature to support this case.

Auditor placement policy check logic

At the moment, we already support config auditorPeriodicPlacementPolicyCheckInterval to check the ledger's segment ensemble is adhering to the placement policy. If the value of auditorPeriodicPlacementPolicyCheckInterval > 0, Auditor will check it by scheduled task. Default value is 0, which means not to check the placement policy.

This feature is supported by [[BP-34](https://bookkeeper.apache.org/bps/BP-34-cluster-metadata-checker/)](https://bookkeeper.apache.org/bps/BP-34-cluster-metadata-checker/)

Drawbacks

In BP-34 Implementation, it detect which ledger fragment's ensemble is not adhering placement policy, only record it to LoggerState, not to repaired the data to adhere placement policy.

Proposal

Based on the above issues, we introduce a new config repairedPlacementPolicyNotAdheringBookieEnabled to handle this case.

In Auditor, if user config auditorPeriodicPlacementPolicyCheckInterval > 0, the scheduled task will check ledger fragment's ensemble is adhering placement policy. If not adhere and config repairedPlacementPolicyNotAdheringBookieEnabled is true, the Auditor will mark the ledger underreplicated.

In ReplicationWorker, it will get the undererplicated ledger, then will check the ledger data integrity then try to move data to alive bookie at now. If config repairedPlacementPolicyNotAdheringBookieEnabled is true, it will check the ledger fragment ensemble is adhering placement policy. The ledger fragment maybe loss data and not adhere placement policy at the same time,

we will ignore repaired adhering placement policy problem in this time, just replicate the data to active bookie and update ensemble info, cause the data integrity is more important. If the ensemble is still not adhering placement policy, the Auditor will mark this ledger again, then ReplicationWorker will repaired adhering placement policy problem.

If the ledger fragment only not adhere placement policy, ReplicationWorker will select other rack bookie to take place of old bookie which in the same rack with other bookies. If there is no more rack bookie, it won't repaired, record no more bookie to LoggerState.

How to replace not adhere fragment ensemble

At the moment, it already supports finding the ledger fragments which need to be replicated. In the ReplicationWorker, it uses a ledger checker to find the ledger fragment which needs to be replicated. It will read data from bookies according to the ensemble. If some bookies shut down or some bookies miss entry, we think the bookie needs to be replaced, and move the data to another bookie. In this case, we already know which bookie has a problem, and find another bookie to replace it. We define this case as DATA_LOSS.

In this feature, the ReplicationWorker will check if the ledger fragment ensemble is adhering to the placement policy. If it does not adhere, we also think the ledger fragment needs to be replicated. But in this case, we didn’t know which bookie needed to be replaced, so we needed to calculate which bookie should be replaced. We use RRTopologyAwareCoverageEnsemble to help us to calculate, it support quorums for every index, so we can use RRTopologyAwareCoverageEnsemble.apply to judge the new bookie is adere the placement policy before adding it to the ensemble.

There is a example to show how to find new bookie to adhere to the placement policy.

There are 9 bookies.

/rack1: bookie1, bookie2, bookie3.

/rack2: bookie4, bookie5, bookie6

/rack3: bookie7, bookie8, bookie9

E:5, WQ:2,AQ:2

CurrentEnsemble: [bookie1, bookie4, bookie7, bookie2, bookie3]

minNumRacksPerWriteQuorum: 2

The replacement step.

1. Select the first bookie(bookie1) from currenEnsemble as the base bookie

In RRTopologyAwareCoverageEnsemble, it didn't has any bookies, we add it as the first bookie in In RRTopologyAwareCoverageEnsemble.

Result.

RRTopologyAwareCoverageEnsemble: [bookie1]

quorums: quorums[0]:[/rack1], quorums[4]:[/rack1]

2. Select next bookie(bookie4), RRTopologyAwareCoverageEnsemble.apply(bookie4) then add it.

In RRTopologyAwareCoverageEnsemble, it will check bookie4(/rack2) is adhere min per write quorum, the RRTopologyAwareCoverageEnsembleFor now exists one bookie(bookie1), so the bookie4 is the second bookie, we should check the quorums[0]. The quorums[0] only has /rack1, bookie4(current want to add) is /rack2, so the number of different rack is greater than minNumRacksPerWriteQuorum, it adhere the placement policy, we add it bookie4 as the second bookie in RRTopologyAwareCoverageEnsemble.

Result.

RRTopologyAwareCoverageEnsemble: [bookie1, bookie4]

quorums:quorums[0]:[/rack1,/rack2], quorums[1]:[/rack2], quorums[4]:[/rack1]

3. Select next bookie(bookie7), RRTopologyAwareCoverageEnsemble.apply(bookie7) then add it.

In RRTopologyAwareCoverageEnsemble, it will check bookie7(/rack3) is adhere min per write quorum, the RRTopologyAwareCoverageEnsembleFor now exists two bookies(bookie1, bookie4), so the bookie7 is the third bookie, we should check the quorums[1]. The quorums[1] only has /rack2, bookie7(current want to add) is /rack3, so the number of different rack is greater than minNumRacksPerWriteQuorum, it adhere the placement policy, we add it bookie7 as the third bookie in RRTopologyAwareCoverageEnsemble.

Result.

RRTopologyAwareCoverageEnsemble: [bookie1, bookie4, bookie7]

quorums: quorums[0]: [/rack1,/rack2], quorums[1]:[/rack2,/rack3], quorums[2]: [/rack3], quorums[4]:[/rack1]

4. Select next bookie(bookie2), RRTopologyAwareCoverageEnsemble.apply(bookie2) then add it.

In RRTopologyAwareCoverageEnsemble, it will check bookie2(/rack1) is adhere min per write quorum, the RRTopologyAwareCoverageEnsembleFor now exists three bookies(bookie1, bookie4), so the bookie2 is the fourth bookie, we should check the quorums[2]. The quorums[2] only has /rack3, bookie2(current want to add) is /rack1, so the number of different rack is greater than minNumRacksPerWriteQuorum, it adhere the placement policy, we add it bookie2 as the fourth bookie in RRTopologyAwareCoverageEnsemble.

Result.

RRTopologyAwareCoverageEnsemble: [bookie1, bookie4, bookie7, bookie2]

quorums: quorums[0]: [/rack1,/rack2], quorums[1]:[/rack2,/rack3], quorums[2]: [/rack3, /rack1], quorums[3]: [/rack1],

quorums[4]:[/rack1]

5. Select last bookie(bookie3), RRTopologyAwareCoverageEnsemble.apply(bookie3) then add it.

In RRTopologyAwareCoverageEnsemble, it will check bookie3(/rack1) is adhere min per write quorum, the RRTopologyAwareCoverageEnsembleFor now exists four bookies(bookie1, bookie4), so the bookie3 is the fifth bookie, we should check the quorums[3]. The quorums[3] only has /rack1, bookie3(current want to add) is also /rack1, so the number of different rack is not greater than minNumRacksPerWriteQuorum, it not adhere the placement policy, we can't add it to RRTopologyAwareCoverageEnsemble.

Then we shuold select other bookie, for the new fifth bookie is adhere placement policy, we should ensure quorums[3] and quorums[4] both has two different rack. We notice that quorums[3] already exists /rack1, quorums[4] already exists /rack1, so we shouldn't pick the new bookie from /rack1, and now the RRTopologyAwareCoverageEnsemble already exists (bookie1,bookie4,bookie7,bookie2) , ww also exclude them. Excludes /rack1 and (bookie1,bookie4,bookie7,bookie2), we can pick (bookie5,bookie6,bookie8,bookie9) as the fifth bookie, the pick is random, we can pick anyone from selectable bookies. Here we random select bookie8 as the fifth bookie.

Result.

RRTopologyAwareCoverageEnsemble: [bookie1, bookie4, bookie7, bookie2, booki8]

quorums: quorums[0]: [/rack1,/rack2], quorums[1]:[/rack2,/rack3], quorums[2]: [/rack3, /rack1], quorums[3]: [/rack1,/rack3],

quorums[4]:[/rack1,/rack3]

6. Finnal, return bookie1,bookie4,bookie7,bookie2,bookie8 as the ensemble

Optimize for replacement policy

If we select CurrentEnsemble first bookie as the base bookie, then to calculate, it maybe replace more bookie in some case.

Example

/rack1: bookie1, bookie2, bookie3

/rack2: bookie4, bookie5, bookie6

E:4, WQ:2,AQ:2

minNumRacksPerWriteQuorum: 2

CurrentEnsemble: [bookie1, bookie2, bookie4, bookie3] (/rack1, /rack1, /rack2, /rack1)

If we select bookie1 as the base bookie, the result is [bookie1, bookie5, bookie2, bookie4] (/rack1, /rack2, /rack1, /rack2), replace three bookies.

In fact, we only replace the first bookie by bookie5, the result is [bookie5, bookie2, bookie4, bookie3] (/rack2, /rack1, /rack2, /rack2), replace one bookie.

So we will calculate the result based on every bookie. finnaly, return the min replacement result.

How to handle the ledger fragment is DATA_LOSS and DATA_NOT_ADHERING_PLACEMENT at the same time

If the ledger fragment is DATA_LOSS and DATA_NOT_ADHERING_PLACEMENT at the same time. We fix

DATA_LOSS case in this time, ignore DATA_NOT_ADHERING_PLACEMENT case.

After we fix DATA_LOSScase, the ledger fragment may be already ahere to the placement policy, if it still not adhere, we will

fix DATA_NOT_ADHERING_PLACEMENT at next time.

How to trigger this feature

If we want to repaired the ledger which ensemble is not adhering placement policy, we should config two param.

auditorPeriodicPlacementPolicyCheckInterval=3600
repairedPlacementPolicyNotAdheringBookieEnabled=true

In Auditor auditorPeriodicPlacementPolicyCheckInterval control the placement policy check detect interval, repairedPlacementPolicyNotAdheringBookieEnabled control is mark ledger Id to under replication managed when found a ledger ensemble not adhere placement policy.

In ReplicationWorker repairedPlacementPolicyNotAdheringBookieEnabled control is to repaired the ledger which ensemble not adhere placement policy.

Attention

  1. we need ensure the config repairedPlacementPolicyNotAdheringBookieEnabled=true in Auditor and ReplicationWorker at the same time.
  2. we also need the placement policy is same between Auditor and ReplicationWorker, cause both all need use placement policy to help to process.

Changes

  1. Support a new config repairedPlacementPolicyNotAdheringBookieEnabled to control is repaired ensemble not adhere placement policy problem.
  2. In Auditor placement policy check process, mark ledger if the ledger ensemble not adhere placement policy whenrepairedPlacementPolicyNotAdheringBookieEnabled is true.
  3. In ReplicationWorker rereplicate, repaired the ledger fragment to adhere placement policy.
  4. Add this feature in the docs.

New api in EnsemblePlacementPolicy

public interface EnsemblePlacementPolicy {
    /**
     * Returns placement result. If the currentEnsemble is not adhering placement policy, returns new ensemble that
     * adheres placement policy. It should be implemented to minify the number of bookies replaced.
     *
     * @param ensembleSize
     *            ensemble size
     * @param writeQuorumSize
 *                writeQuorumSize of the ensemble
     * @param ackQuorumSize
     *            ackQuorumSize of the ensemble
     * @param excludeBookies
     *            bookies that should not be considered as targets
     * @param currentEnsemble
     *            current ensemble
     * @return a placement result
     */
    default PlacementResult<List<BookieId>> replaceToAdherePlacementPolicy(
            int ensembleSize,
            int writeQuorumSize,
            int ackQuorumSize,
            Set<BookieId> excludeBookies,
            List<BookieId> currentEnsemble) {
        throw new UnsupportedOperationException();
    }
}

Be CareFul

  1. Now the feature only support RackawareEnsemblePlacementPolicy.
  2. If this feature open, there maybe lots of ledger will be mark underreplicated. The replicationWorker will replicate lots of ledger, it will increase read request and write request in bookie server. You should set a suitable rereplicationEntryBatchSize to avoid bookie server pressure.

Compatibility, Deprecation, and Migration Plan

The repairedPlacementPolicyNotAdheringBookieEnabled default is false, if user upgrade the new release, it won't change any behavior compared to before.

Test Plan

We will add tests for the following module.

  1. Auditor, test the ledger is marked underreplicated when the ledger fragment policy is not adhering placement policy.
  2. ReplicationWorker, test the not adhering placement policy fragment is repaired to adhere placement policy.

horizonzy avatar Jun 29 '22 13:06 horizonzy

The proposal PR: https://github.com/apache/bookkeeper/pull/3359

horizonzy avatar Jun 29 '22 13:06 horizonzy

@eolivelli ping

horizonzy avatar Jul 29 '22 13:07 horizonzy

@eolivelli updated the BP, could you take a look again, thanks

horizonzy avatar Aug 01 '22 12:08 horizonzy

@eolivelli Would you please help review this PR again? thanks. We hope this feature can be included in 4.16.0

hangc0276 avatar Aug 03 '22 02:08 hangc0276

closed by #3359

shoothzj avatar May 07 '24 00:05 shoothzj