daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-11679 dtx: misc patch for dtx related issues

Open Nasf-Fan opened this issue 3 years ago • 10 comments

The patch fixes the following DTX related issues:

  1. Use the address of tree handle for dtx_rpc_post. If we directly use the value of tree handle as dtx_rpc_post()'s input parameter, then it may be DAOS_HDL_INVAL before DTX helper ULT being scheduled. That will cause memory leak.

  2. Abort remote DTX entry before local one. The DTX abort maybe triggered by dtx_leader_end for RPC timeout on some remote DTX participant(s). Under such case, client side RPC sponsor may also hit the RPC timeout and resends related RPC to the leader. To avoid DTX abort and resend RPC forwarding being executed in parallel, we will abort local DTX after remote done, before that the logic of handling resent RPC on server will find the local pinned DTX entry then notify related client to resend RPC sometime later.

  3. More frequent CPU yield during DTX RPCs dispatch. It is possible that the DTX may contains a lot of participants. When we commit or abort such DTX, the leader needs to send RPCs to all related participant. That will take a lot of CPU cycles. Under such case, related DTX operation ULT will yield to avoid blocking others for too long time. The patch reduces the yield interval from per 64-RPCs to per 32-RPCs.

  4. Dynamically load DTX participants information (dtx_memberships). For the transaction with very large participants information, handle related DTX (such as for DTX resync, for cleanup stale DTX entries) may take some time, expecially under the case of system very busy. If we pre-load all related MBS information in DRAM before really handle them, it will hold a lot of DRAM resource for long time. That may cause server OOM. The patch adjusts DTX MBS load policy as loading large MBS when use it instead of pre-loading.

  5. Use absolute shard index to reassemble CPD RPC. The absolute shard index was changed accidentally to local index within single redundancy group when handle EC parity rotation. That will cause unexpected CPD RPC reassemble on leader. On the other hand, the reassemble logic on leader has some defect that may cause access DRAM out of boundary. The patch also fixes it.

Signed-off-by: Fan Yong [email protected]

Nasf-Fan avatar Sep 29 '22 14:09 Nasf-Fan

Bug-tracker data: Ticket title is 'server segfault in mdtest easy delete phase of io-500 on Aurora' Status is 'In Progress' Labels: 'tds,triaged' https://daosio.atlassian.net/browse/DAOS-11679

github-actions[bot] avatar Sep 29 '22 14:09 github-actions[bot]

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/5/execution/node/1100/log

daosbuild1 avatar Oct 01 '22 03:10 daosbuild1

Hit DAOS-11178, to be retested.

Nasf-Fan avatar Oct 01 '22 13:10 Nasf-Fan

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/6/execution/node/143/log

daosbuild1 avatar Oct 01 '22 13:10 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/6/execution/node/1194/log

daosbuild1 avatar Oct 02 '22 23:10 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/8/execution/node/1116/log

daosbuild1 avatar Oct 07 '22 16:10 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/8/execution/node/1137/log

daosbuild1 avatar Oct 07 '22 18:10 daosbuild1

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/9/execution/node/866/log

daosbuild1 avatar Oct 08 '22 06:10 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/9/execution/node/1094/log

daosbuild1 avatar Oct 09 '22 12:10 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10441/9/execution/node/1183/log

daosbuild1 avatar Oct 09 '22 21:10 daosbuild1

@wangdi1 @liuxuezhao , would you please to help review the patch? Thanks!

Nasf-Fan avatar Oct 21 '22 02:10 Nasf-Fan

Ping @daos-stack/daos-gatekeeper , thanks!

Nasf-Fan avatar Oct 26 '22 03:10 Nasf-Fan