DAOS-17861 cart: avoid sending RPC reply repeatedly - b26
When handle collective RPC, some failure may happen before invoking RPC handler for local node process. Then crt_hg_reply_send() may be triggered. And then in subsequent process, crt_rpc_handler_common() will call crt_hg_reply_error_send() to reply the RPC repeatedly. It is observed that the latter one maybe failed with NA_BUSY and cause the callback for former reply to be blocked or lost. Then reference on the RPC cannot be released. Such RPC leaking may cause assertion in UCX environment when destroy related CaRT context.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Segmentation fault against UCX provider during CR test' Status is 'In Progress' Labels: '2.8pp,scrubbed_2.6.5,scrubbed_2.8' https://daosio.atlassian.net/browse/DAOS-17861
Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17230/2/execution/node/1497/log
Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17230/2/execution/node/1497/log
test_pool_destroy_with_io failed for DAOS-18327, not related with the patch, to be retested.
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17230/3/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17230/4/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17230/7/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17230/8/testReport/