daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-15931 rebuild: fix data corruption caused by partial parity rebuild epoch (#14512)

Open gnailzenh opened this issue 1 year ago • 2 comments

…ild epoch (#14512)

Rebuild code change:

  1. __migrate_fetch_update_parity(), fix a bug when set partial replica rebuild epoch for parity shard rebuild.
  2. __migrate_fetch_update_bulk() should carry DIOF_FOR_MIGRATION flag,
  3. migrate_fetch_update_parity() parameter fix when calling __migrate_fetch_update_parity().

EC aggregation change:

  1. ds_obj_ec_rep_handler() and ds_obj_ec_agg_handler(), the vos_update_begin() should carry VOS_OF_REBUILD to avoid -DER_VOS_PARTIAL_UPDATE failure.
  2. give more chance to abort EC agg when rebuild started, to save conflict window.

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

gnailzenh avatar Jun 09 '24 16:06 gnailzenh

Ticket title is 'data corruptions found if reintegration triggered while pool rebuild is running' Status is 'Awaiting backport' Labels: 'google-cloud-daos,scrubbed_2.8' Job should run at elevated priority (1) Errors are Title of PR is too long https://daosio.atlassian.net/browse/DAOS-15931

github-actions[bot] avatar Jun 09 '24 16:06 github-actions[bot]

Looks like duplicate of #14529, which already landed?

daltonbohning avatar Jun 17 '24 15:06 daltonbohning

Test stage Python Bandit check completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-14535/2/execution/node/146/log

daosbuild3 avatar Oct 24 '25 10:10 daosbuild3