daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17738 client: reset DTX base UUID after fork

Open Nasf-Fan opened this issue 6 months ago • 1 comments

To avoid parent and child threads generating the same DTX ID.

It also changes vos_dtx logic to avoid assertion when client reuses some DTX ID.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

Nasf-Fan avatar Jun 25 '25 04:06 Nasf-Fan

Ticket title is 'daos rebuild cluster has some asserted engines with dtx_cmt_ent_update() Assertion 'dce_new->dce_reindex'' Status is 'In Review' Labels: 'ALCF,alcf_track,hpe_cluster' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-17738

github-actions[bot] avatar Jun 25 '25 04:06 github-actions[bot]

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/5/execution/node/1425/log

daosbuild3 avatar Jul 02 '25 06:07 daosbuild3

the failed ior runs looks specific to this PR and not known issues:

https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/view/change-requests/job/PR-16539/5/artifact/Functional%20Hardware%20Medium%20MD%20on%20SSD/ior/small.py/daos_logs.hdr-134/test_ior_small_daos_client.log/view/

07/02-05:13:47.83 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.31.1 rpc 0x7fc604006d30 opc 0 to rank 3 tag 9: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.19.1 rpc 0x7fc5ec00b5f0 opc 0 to rank 3 tag 8: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_shard.c:854 dc_rw_cb() a40b49fd/7c51be16 939000593309949439.0.19.1 rpc 0x7fc5d40e7040 opc 0 to rank 3 tag 8: DER_TX_ID_REUSED(-2040): 'TX ID may be reused' 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] object ERR src/object/cli_obj.c:4883 obj_comp_cb() TX ID maybe reused for unknown reason, task 0x7fc5ec00db90, opc 0, flags 100000, retry_cnt 1 07/02-05:13:47.84 hdr-134 DAOS[250477/250478/0] dfuse WARN src/client/dfuse/ops/write.c:17 dfuse_cb_write_complete(0x7fc5f0003650) Returning: 5 (Input/output error)

mchaarawi avatar Jul 03 '25 16:07 mchaarawi

hmm, I'm holding off on landing this patch for now, since Fanyong still has concern.

gnailzenh avatar Jul 10 '25 15:07 gnailzenh

hmm, I'm holding off on landing this patch for now, since Fanyong still has concern.

yes i agree. i will wait for FY to reply to my last comment

mchaarawi avatar Jul 10 '25 15:07 mchaarawi

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/11/execution/node/543/log

daosbuild3 avatar Jul 21 '25 03:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16539/11/execution/node/498/log

daosbuild3 avatar Jul 21 '25 06:07 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16539/14/testReport/

daosbuild3 avatar Jul 23 '25 15:07 daosbuild3

NLT has same error as reported here https://daosio.atlassian.net/browse/DAOS-17416

mchaarawi avatar Jul 24 '25 16:07 mchaarawi