daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-623 rebuild: uniform identifier in logs part 1

Open kccain opened this issue 1 year ago • 3 comments

To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads.

The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs.

The baseline format (defined in the DF_RB macro) is: "rb=" DF_UUID "/%u/" DF_U64 "/%u/%u/%s" and corresponds to: <pool_uuid>/<rebuild_ver>/<ps_term>/<rebuild_gen>/<ps_rank>/<opc_str>

A verbose format (defined in the DF_RBF macro) adds the following (for <engine_rank>:<tgt_idx>) r:t=%u:%d"

Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as: struct rebuild_global_pool_tracker struct rebuild_tgt_pool_tracker struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input) struct migrate_query_arg

This initial patch covers the pool service leader execution in functions (and those that they invoke) such as: rebuild_ults() rebuild_task_ult() rebuild_leader_start() rebuild_leader_status_check() rebuild_leader_status_notify().

And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as: rebuild_tgt_scan_handler() rebuild_tgt_status_check_ult() ds_migrate_query_status() migrate_check_one() dss_rebuild_check_one() rebuild_scan_leader() rebuild_scanner() rebuild_objects_send_ult() rebuild_scan_done()

Features: rebuild Allow-unstable-test: true Skip-list: test_ec_degrade,DAOS-15843

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [x] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [x] Commit messages follows the guidelines outlined here.
  • [x] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

kccain avatar May 15 '24 23:05 kccain

Ticket title is 'rebuild enhancement: uniform identifier in log messages' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-16111

github-actions[bot] avatar May 15 '24 23:05 github-actions[bot]

@liuxuezhao @gnailzenh here is a draft patch for what I discussed today. Can you take a look and provide some early comments on the approach?

kccain avatar May 15 '24 23:05 kccain

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/2/execution/node/1202/log

daosbuild1 avatar May 16 '24 09:05 daosbuild1

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/145/log

daosbuild1 avatar May 21 '24 16:05 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/348/log

daosbuild1 avatar May 21 '24 16:05 daosbuild1

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/374/log

daosbuild1 avatar May 21 '24 16:05 daosbuild1

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/349/log

daosbuild1 avatar May 21 '24 16:05 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/369/log

daosbuild1 avatar May 21 '24 16:05 daosbuild1

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/7/display/redirect

daosbuild1 avatar May 21 '24 17:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/7/display/redirect

daosbuild1 avatar May 21 '24 17:05 daosbuild1

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/8/display/redirect

daosbuild1 avatar May 21 '24 22:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/8/display/redirect

daosbuild1 avatar May 21 '24 22:05 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/651/log

daosbuild1 avatar May 29 '24 04:05 daosbuild1

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/667/log

daosbuild1 avatar May 29 '24 05:05 daosbuild1

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/505/log

daosbuild1 avatar May 29 '24 05:05 daosbuild1

all per-PR tests passed. Some full_regression (weekly regresison tests) failed and are known issues (listed at the end)

But this one from Functional HW Large testing that includes an engine assertion seems new.] I don't see a direct link between the changes in this PR and the aggregation engine assertion and stacktrace. @liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?

  • erasurecode/offline_rebuild_aggregation.py failure

ERROR: daos_engine:0 05/29-00:48:44.26 wolf-113 DAOS[185489/8/640] vos EMRG src/vos/vos_aggregate.c:640 recx2ext() Assertion 'recx->rx_nr > 0' failed

For now I've prepared a test-only PR 14474 just the same master commit base as this patch, to see if it reproduces independently.

Functional HW Large testing with known existing failures:

Functional HW Large MD on SSD testing with known existing failures:

  • erasurecode/online_rebuild.py cases 2 and 4 failures are also an instance of DAOS-15863

Functional HW Medium MD on SSD testing with known existing failures:

  • deployment/server_rank_failure.py cases 1 and 2 failures are instances of DAOS-15809

kccain avatar May 29 '24 22:05 kccain

"@liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?" I think that assertion in vos agg code path should be unrelated with this PR, I created https://daosio.atlassian.net/browse/DAOS-15941 for it.

liuxuezhao avatar May 30 '24 03:05 liuxuezhao

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/13/execution/node/1153/log

daosbuild1 avatar May 30 '24 19:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/13/execution/node/1153/log

looks like some CI infrastructure issue "CommandFailure("ERROR: Copying yaml configuration file to wolf-231vm[1,8]: Timeout detected running 'clush -S -v -w wolf-231vm[1,8] /usr/bin/mkdir -p /etc/daos' with a 15s timeout",)"

Kicking off the testing again from the Test stage ...

kccain avatar May 30 '24 21:05 kccain

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/15/execution/node/146/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/15/execution/node/323/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/15/display/redirect

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/15/display/redirect

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/145/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/303/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/365/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/300/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/355/log

daosbuild1 avatar May 31 '24 04:05 daosbuild1

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/16/display/redirect

daosbuild1 avatar May 31 '24 05:05 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/16/display/redirect

daosbuild1 avatar May 31 '24 05:05 daosbuild1