DAOS-17387 pool: fix concurrent flag updates and rebuild checks
-
Fix unsafe concurrent updates of ds_pool flags across multiple Xstreams
- Restrict flag modification to system Xstream only
- Add synchronization for cross-Xstream flag operations
-
Clarify rebuild disablement mechanisms:
-
Offline flag updates (via
ddb featurecommand): - Uses persistent sp_disable_rebuild flag - Now protected from accidental clearing during auto-rebuild -
Runtime self-heal configuration:
- Uses ephemeral sp_self_heal property
- Maintains separate state tracking
-
-
Prevent flag conflict in ds_pool_tgt_prop_update()
- Stop overriding ddb-set sp_disable_rebuild when auto-rebuild enabled
- Decouple persistent flag (sp_disable_rebuild) from runtime state (sp_self_heal)
This resolves race conditions in pool flag management and preserves administrator-set disablement states through rebuild lifecycle transitions.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Race Conditions and State Conflicts in Concurrent Pool Flag Updates and Rebuild Checks' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17387
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-16222/3/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/9/testReport/
Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/9/testReport/
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16222/10/testReport/
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/11/execution/node/1545/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/11/execution/node/1500/log
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/12/execution/node/500/log
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/13/execution/node/1571/log
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16222/13/execution/node/1571/log
The only failure looks like a CI enviroment issue: hdr-222: -- Logs begin at Thu 2025-07-10 22:00:17 UTC, end at Thu 2025-07-10 22:08:44 UTC. -- Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:07:56 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:14 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready Jul 10 22:08:32 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready Jul 10 22:08:32 hdr-222.daos.hpc.amslabs.hpecorp.net kernel: IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16222/17/display/redirect
Ping a second review!