daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17387 pool: fix concurrent flag updates and rebuild checks

Open wangshilong opened this issue 9 months ago • 5 comments

  • Fix unsafe concurrent updates of ds_pool flags across multiple Xstreams

    • Restrict flag modification to system Xstream only
    • Add synchronization for cross-Xstream flag operations
  • Clarify rebuild disablement mechanisms:

    1. Offline flag updates (via ddb feature command): - Uses persistent sp_disable_rebuild flag - Now protected from accidental clearing during auto-rebuild

    2. Runtime self-heal configuration:

      • Uses ephemeral sp_self_heal property
      • Maintains separate state tracking
  • Prevent flag conflict in ds_pool_tgt_prop_update()

    • Stop overriding ddb-set sp_disable_rebuild when auto-rebuild enabled
    • Decouple persistent flag (sp_disable_rebuild) from runtime state (sp_self_heal)

This resolves race conditions in pool flag management and preserves administrator-set disablement states through rebuild lifecycle transitions.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

wangshilong avatar Apr 09 '25 10:04 wangshilong

Ticket title is 'Race Conditions and State Conflicts in Concurrent Pool Flag Updates and Rebuild Checks' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17387

github-actions[bot] avatar Apr 09 '25 10:04 github-actions[bot]

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16222/3/testReport/

daosbuild1 avatar Apr 12 '25 03:04 daosbuild1

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/9/testReport/

daosbuild3 avatar May 27 '25 03:05 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/9/testReport/

daosbuild3 avatar May 27 '25 09:05 daosbuild3

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/10/testReport/

daosbuild3 avatar Jun 11 '25 07:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/11/execution/node/1545/log

daosbuild3 avatar Jul 08 '25 23:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/11/execution/node/1500/log

daosbuild3 avatar Jul 09 '25 02:07 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/12/execution/node/500/log

daosbuild3 avatar Jul 10 '25 02:07 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/13/execution/node/1571/log

daosbuild3 avatar Jul 11 '25 02:07 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/13/execution/node/1571/log

The only failure looks like a CI enviroment issue: hdr-222: -- Logs begin at Thu 2025-07-10 22:00:17 UTC, end at Thu 2025-07-10 22:08:44 UTC. -- Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready Jul 10 22:08:32 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:32 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready

wangshilong avatar Jul 12 '25 05:07 wangshilong

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16222/17/display/redirect

daosbuild3 avatar Jul 31 '25 21:07 daosbuild3

Ping a second review!

wangshilong avatar Aug 11 '25 06:08 wangshilong