daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-16111 rebuild: uniform identifier in logs part 2 (migrate)

Open kccain opened this issue 1 year ago • 11 comments

When rebuild emits log messages, include a uniform rebuild operation identifier. This change adjusts existing logging for rebuild migrate activities. A previous patch added the same operation identifier in log messages by the PS leader and storage engine scan activities.

Reminder, the baseline format (defined in DF_RB) for the uniform identifier is: "rb=" DF_UUID "/%u/%u/%s" that corresponds to: <pool_uuid>/<rebuild_ver>/<rebuild_gen>/<opcode_string>

This change adds DP_RB_OMI, DP_RB_MPT, and DP_RB_MRO macros that accept struct obj_migrate_in *omi, struct migrate_pool_tls *mpt, and struct migrate_one *mro, respectively, to provide the values needed for the DF_RB logging format. And the patch applies them throughout the existing logging performed in migrate activities.

Also in this change is some logic added to rebuild_leader_status_check(): a new function, warn_for_slow_engine_updates(). This allows a PS leader engine to emit warnings when an engine is not reporting its rebuild progress (via IV) for a long amount of time, making it easier for an engineer to identify what engine(s) may be causing a stuck rebuild. The warning messages are throttled to avoid too many log file entries.

Features: rebuild Allow-unstable-test: true

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

kccain avatar Jun 11 '24 14:06 kccain

Ticket title is 'rebuild enhancement: uniform identifier in log messages' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-16111

github-actions[bot] avatar Jun 11 '24 14:06 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14547/3/testReport/

daosbuild1 avatar Jun 11 '24 16:06 daosbuild1

@liuxuezhao I have this follow-on to PR https://github.com/daos-stack/daos/pull/14383 (which is awaiting landing to master branch).

While I try to get this PR through CI (including with Features: rebuild), if you have some time can you start to review this change? Thanks.

kccain avatar Jun 11 '24 20:06 kccain

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14547/10/testReport/

daosbuild1 avatar Jun 26 '24 16:06 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1184/log

(kccain edit) rebuild/cascading_failures.py : instance of known issue https://daosio.atlassian.net/browse/DAOS-15994

daosbuild1 avatar Jun 27 '24 03:06 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1421/log

(kccain edit) engine assert in free() likely due to my patch not checking indexing into the new array in struct rebuild_global_pool_tracker nvme/pool_extend.py

daosbuild1 avatar Jun 27 '24 17:06 daosbuild1

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1442/log

(kccain edit) engine assert in free() likely due to my patch not checking indexing into the new array in struct rebuild_global_pool_tracker nvme/pool_extend.py

ior/hard_rebuild.py is a known issue documented in https://daosio.atlassian.net/browse/DAOS-15863?focusedCommentId=130026

daosbuild1 avatar Jun 27 '24 18:06 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1607/log

(kccain edit) deployment/target_failure.py failure is a known issue https://daosio.atlassian.net/browse/DAOS-16109

daosbuild1 avatar Jun 30 '24 17:06 daosbuild1

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1628/log

(kccain edit) deployment/server_rank_failure.py known issue, this failure documented in https://daosio.atlassian.net/browse/DAOS-15809?focusedCommentId=130033

deployment/target_failure.py failure is a known issue https://daosio.atlassian.net/browse/DAOS-16109

daosbuild1 avatar Jun 30 '24 18:06 daosbuild1

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/12/execution/node/1946/log

(kccain edit) all 28 variants failed to run due to cart_ctl utility failing, known issue https://daosio.atlassian.net/browse/DAOS-16008

daosbuild1 avatar Jun 30 '24 18:06 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/14/execution/node/1184/log

(kccain edit) control/ms_failover.py failure is a known issue https://daosio.atlassian.net/browse/DAOS-16103

daosbuild1 avatar Jul 01 '24 20:07 daosbuild1

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14547/14/testReport/

(kccain edit) test_daos_rebuild_ec - REBUILD45 failure is a known issue https://daosio.atlassian.net/browse/DAOS-16035

daosbuild1 avatar Jul 06 '24 03:07 daosbuild1

Testing has gone well in Jenkins build 12 (Features: rebuild) and build 14 (pr test_nvme_pool_extend).

Ready for review.

The patch has a conflict with latest master (src/object/srv_obj_migrate.c), that I will address at the same time as resolving reviewer comments.

kccain avatar Jul 07 '24 21:07 kccain

@liuxuezhao and @wangshilong will you have a chance to review this code change? It is a follow-on to https://github.com/daos-stack/daos/pull/14383 that I would like to both be included in an upcoming release. Thanks!

kccain avatar Aug 19 '24 20:08 kccain

@liuxuezhao and @wangshilong will you have a chance to review this code change? It is a follow-on to #14383 that I would like to both be included in an upcoming release. Thanks!

I really think that patches like this can help developers when we need to debug difficult problems, and those at scale, by making it easier to search for logging related to a specific rebuild operation.

I guess the next opportunity to get the part 1 (already landed) patch https://github.com/daos-stack/daos/pull/14383 , and this one, would be DAOS 2.8 community release? Unless we're possibly too late even for that?

kccain avatar Oct 30 '24 16:10 kccain

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14547/15/testReport/

daosbuild1 avatar Nov 13 '24 17:11 daosbuild1

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14547/16/testReport/

daosbuild1 avatar Nov 13 '24 18:11 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/17/execution/node/1432/log

daosbuild1 avatar Nov 14 '24 12:11 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14547/17/execution/node/1432/log

erasurecode/multiple_target_failure.py test failure is an instance of known issue https://daosio.atlassian.net/browse/DAOS-16766

kccain avatar Nov 14 '24 17:11 kccain

ping reviewers

kccain avatar Nov 22 '24 12:11 kccain