daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-16251 pool: DEBUG patch, IV pool map buf investigation

Open kccain opened this issue 1 year ago • 3 comments

Adds debug logging in IV code, to examine pool map buffer corruption scenarios:

  • possible prevention of uninitialized d_sg_list_t in crt_hdlr_iv_sync_aux() and call_pre_sync_cb() which could theoretically impact pool buffer map contents from IV communication. And, adds some associated logging.
  • crt_ivsync_issue_rpc() explicitly log if bulk or inline corpc will be used. To correspond to the crt_hdlr_iv_sync_aux() and call_pre_sync_cb() logging.

And, in case it becomes needed during investigation, this change also contains a cherry-pick of PR 14702: DAOS-16164 pool: Update target status to UPIN for no_data_sync mode

Finally, includes a manual cherry pick of PR 14971, aaoganez/rpc-bulk-deadlines

  • Switch rpc headers to transfer deadline instead of a timeout
  • Add checks at the start and end of bulk transfer to ensure deadline has not expired.
  • Add deadline expiration checks in all places where rpc_priv timeout is initialized

Allow-unstable-test: true faults-enabled: false

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

kccain avatar Aug 14 '24 19:08 kccain

Ticket title is 'DAOS 2.4.2-4: Errored DAOS engine 0 exited unexpectedly on daos_user' Status is 'In Progress' Labels: 'ALCF,pre_acceptance_issues,scrubbed_2.8' https://daosio.atlassian.net/browse/DAOS-16251

github-actions[bot] avatar Aug 14 '24 19:08 github-actions[bot]

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14929/2/testReport/

daosbuild1 avatar Aug 16 '24 10:08 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14929/6/execution/node/1270/log

daosbuild1 avatar Aug 27 '24 11:08 daosbuild1