daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17737 dtx: handle race between DTX refresh and DTX abort

Open Nasf-Fan opened this issue 6 months ago • 3 comments

If current transaction is aborted during dtx_refresh() yield by race, then return non-zero value to the sponsor to trigger client side RPC retry. That will make related transaction's status to be more clean.

More check after dtx_refresh() to avoid re-initializing aborted DTX.

The patch also cleanup the usage for vos_dtx_validation() to handle kinds of DTX abort (and maybe resent after that) cases.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

Nasf-Fan avatar Jun 24 '25 15:06 Nasf-Fan

Ticket title is '"D_ASSERT(dth->dth_ent != NULL);" failure in dtx_handle_reinit()' Status is 'In Progress' Labels: 'ALCF,hpe_cluster' https://daosio.atlassian.net/browse/DAOS-17737

github-actions[bot] avatar Jun 24 '25 15:06 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16535/4/testReport/

daosbuild3 avatar Jun 26 '25 16:06 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16535/4/testReport/

daosbuild3 avatar Jun 26 '25 17:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16535/6/execution/node/885/log

daosbuild3 avatar Jul 08 '25 02:07 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16535/6/execution/node/930/log

daosbuild3 avatar Jul 08 '25 03:07 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16535/6/execution/node/840/log

daosbuild3 avatar Jul 08 '25 05:07 daosbuild3

@NiuYawei @wangshilong , please help to review the patch that you have ever reviewed. Thanks!

Nasf-Fan avatar Jul 21 '25 01:07 Nasf-Fan