DAOS-17738 client: reset DTX base UUID after fork
To avoid parent and child threads generating the same DTX ID.
It also changes vos_dtx logic to avoid assertion when client reuses some DTX ID.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'daos rebuild cluster has some asserted engines with dtx_cmt_ent_update() Assertion 'dce_new->dce_reindex'' Status is 'In Review' Labels: 'ALCF,alcf_track,hpe_cluster' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-17738
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/5/execution/node/1425/log
the failed ior runs looks specific to this PR and not known issues:
https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/view/change-requests/job/PR-16539/5/artifact/Functional%20Hardware%20Medium%20MD%20on%20SSD/ior/small.py/daos_logs.hdr-134/test_ior_small_daos_client.log/view/
07/02-05:13:47.83 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.31.1 rpc 0x7fc604006d30 opc 0 to rank 3 tag 9: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.19.1 rpc 0x7fc5ec00b5f0 opc 0 to rank 3 tag 8: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.19.1 rpc 0x7fc5d40e7040 opc 0 to rank 3 tag 8: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_obj.c:4883 obj_comp_cb() TX ID maybe reused for unknown reason, task 0x7fc5ec00db90, opc 0, flags 100000, retry_cnt 1 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] dfuse WARN src/client/dfuse/ops/write.c:17 dfuse_cb_write_complete(0x7fc5f0003650) Returning: 5 (Input/output error)
hmm, I'm holding off on landing this patch for now, since Fanyong still has concern.
hmm, I'm holding off on landing this patch for now, since Fanyong still has concern.
yes i agree. i will wait for FY to reply to my last comment
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/11/execution/node/543/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/11/execution/node/498/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16539/14/testReport/
NLT has same error as reported here https://daosio.atlassian.net/browse/DAOS-17416