DAOS-623 rebuild: uniform identifier in logs part 1
To the extent possible in the rebuild code execution flow, when rebuild emits log messages, include a uniform rebuild operation identifier in those messages. This covers activities across all pool storage engines (including the pool service leader), system and per-target threads/xstreams, and dynamically spawned user-level threads.
The motivation is to enable some amount of automated searching through logfiles for all (or specific) rebuilds that occurred during execution, and speed up DAOS engineer analysis/interpretation of the logs.
The baseline format (defined in the DF_RB macro) is: "rb=" DF_UUID "/%u/" DF_U64 "/%u/%u/%s" and corresponds to: <pool_uuid>/<rebuild_ver>/<ps_term>/<rebuild_gen>/<ps_rank>/<opc_str>
A verbose format (defined in the DF_RBF macro) adds the following (for <engine_rank>:<tgt_idx>) r:t=%u:%d"
Various DP_RB_* and DP_RBF_* macros are defined to specify the arguments to go with the DF_RB and DF_RBF formats, given some common rebuild implementation structures such as: struct rebuild_global_pool_tracker struct rebuild_tgt_pool_tracker struct rebuild_scan_in (REBUILD_OBJECTS_SCAN RPC input) struct migrate_query_arg
This initial patch covers the pool service leader execution in functions (and those that they invoke) such as: rebuild_ults() rebuild_task_ult() rebuild_leader_start() rebuild_leader_status_check() rebuild_leader_status_notify().
And this patch covers "scan side" execution in all pool storage engines (including the leader), in functions such as: rebuild_tgt_scan_handler() rebuild_tgt_status_check_ult() ds_migrate_query_status() migrate_check_one() dss_rebuild_check_one() rebuild_scan_leader() rebuild_scanner() rebuild_objects_send_ult() rebuild_scan_done()
Features: rebuild Allow-unstable-test: true Skip-list: test_ec_degrade,DAOS-15843
Before requesting gatekeeper:
- [ ] Two review approvals and any prior change requests have been resolved.
- [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
- [x]
Features:(orTest-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR. - [x] Commit messages follows the guidelines outlined here.
- [x] Any tests skipped by the ticket being addressed have been run and passed in the PR.
Gatekeeper:
- [ ] You are the appropriate gatekeeper to be landing the patch.
- [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
- [ ] Githooks were used. If not, request that user install them and check copyright dates.
- [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
- [ ] All builds have passed. Check non-required builds for any new compiler warnings.
- [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
- [ ] If applicable, the PR has addressed any potential version compatibility issues.
- [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
- [ ] Extra checks if forced landing is requested
- [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
- [ ] No new NLT or valgrind warnings. Check the classic view.
- [ ] Quick-build or Quick-functional is not used.
- [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.
Ticket title is 'rebuild enhancement: uniform identifier in log messages' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-16111
@liuxuezhao @gnailzenh here is a draft patch for what I discussed today. Can you take a look and provide some early comments on the approach?
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/2/execution/node/1202/log
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/145/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/348/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/374/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/349/log
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/7/execution/node/369/log
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/7/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/7/display/redirect
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/8/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/8/display/redirect
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/651/log
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/667/log
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/10/execution/node/505/log
all per-PR tests passed. Some full_regression (weekly regresison tests) failed and are known issues (listed at the end)
But this one from Functional HW Large testing that includes an engine assertion seems new.] I don't see a direct link between the changes in this PR and the aggregation engine assertion and stacktrace. @liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?
- erasurecode/offline_rebuild_aggregation.py failure
ERROR: daos_engine:0 05/29-00:48:44.26 wolf-113 DAOS[185489/8/640] vos EMRG src/vos/vos_aggregate.c:640 recx2ext() Assertion 'recx->rx_nr > 0' failed
For now I've prepared a test-only PR 14474 just the same master commit base as this patch, to see if it reproduces independently.
Functional HW Large testing with known existing failures:
- ior/hard_rebuild.py failure is an instance of DAOS-15863
Functional HW Large MD on SSD testing with known existing failures:
- erasurecode/online_rebuild.py cases 2 and 4 failures are also an instance of DAOS-15863
Functional HW Medium MD on SSD testing with known existing failures:
- deployment/server_rank_failure.py cases 1 and 2 failures are instances of DAOS-15809
"@liuxuezhao do you think this patch could have caused it? Or possibly it is a rare issue not yet seen in the testing?" I think that assertion in vos agg code path should be unrelated with this PR, I created https://daosio.atlassian.net/browse/DAOS-15941 for it.
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/13/execution/node/1153/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/13/execution/node/1153/log
looks like some CI infrastructure issue "CommandFailure("ERROR: Copying yaml configuration file to wolf-231vm[1,8]: Timeout detected running 'clush -S -v -w wolf-231vm[1,8] /usr/bin/mkdir -p /etc/daos' with a 15s timeout",)"
Kicking off the testing again from the Test stage ...
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/15/execution/node/146/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/15/execution/node/323/log
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/15/display/redirect
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/15/display/redirect
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/145/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/303/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/365/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/300/log
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14383/16/execution/node/355/log
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/16/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14383/16/display/redirect