daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-15937 test: Automate metadata duplicate rpc detection time consuming

Open dinghwah opened this issue 1 year ago • 12 comments

Automate metadata duplicate rpc detection time consuming

Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-large-md-on-ssd: false Test-tag: test_metadata_dup_rpc Doc-only: false Required-githooks: true

Signed-off-by: Ding Ho [email protected]

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

dinghwah avatar May 29 '24 19:05 dinghwah

Errors are Unable to load ticket data https://daosio.atlassian.net/browse/DAOS-15937

github-actions[bot] avatar May 29 '24 19:05 github-actions[bot]

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/2/testReport/

daosbuild1 avatar May 31 '24 03:05 daosbuild1

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/143/log

daosbuild1 avatar Jun 10 '24 16:06 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/267/log

daosbuild1 avatar Jun 10 '24 16:06 daosbuild1

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/328/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/329/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/143/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/317/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/381/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/353/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/285/log

daosbuild1 avatar Jun 10 '24 17:06 daosbuild1

Please review the new updates.

dinghwah avatar Jun 27 '24 17:06 dinghwah

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/12/testReport/

daosbuild1 avatar Aug 07 '24 09:08 daosbuild1

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/

daosbuild1 avatar Aug 23 '24 09:08 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/

daosbuild1 avatar Aug 23 '24 09:08 daosbuild1

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/

The job.log has 200,000 lines of output, 100,000 pairs of items like this, that are output during the timing loops. We may want to look at making the output quiet. I think I had experimented with this at some point, but not sure if I can find the prototype code I used since it has been a while.

2024-08-23 08:32:00,394 test_utils_conta L0624 INFO | Opening container 138475AD-324F-4A86-8E25-D2D213FAABEB
2024-08-23 08:32:00,399 test_utils_conta L0645 INFO | Closing container 138475AD-324F-4A86-8E25-D2D213FAABEB

Failure reason #Dup rpc detection time 13.781859397888184 > baseline_time 5.202355599403381 * threshold_factor 2.5 but the actual time comparison should be 5.x versus 19..x seconds, which is the "stabilized" time after getting through the initial svc_ops_entry_age:60 seconds period

From the job.log, the first 10 test-loops with svc_ops_enabled:1

2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 0,  8.415847
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 1,  13.758781
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 2,  18.333476
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 3,  19.055028
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 4,  19.116618
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 5,  19.093971
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 6,  19.083379
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 7,  19.085228
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 8,  19.120708
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 9,  19.120152

I have an experimental PR https://github.com/daos-stack/daos/pull/14997 to try to establish a baseline timing expectation for pools with svc_ops_enabled:1 and svc_ops_entry_age:60, on a variety of HW cluster configurations, by running the existing C code daos_test -c --subtests="34" (co_op_dup_timing).

That test is run daily, but is done on Functional Hardware Medium Verbs Provider, and Functional Hardware Medium Verbs Provider MD on SSD configurations (whereas this PR executes on Functional Hardware Medium, and Functional Hardware Medium MD on SSD).

Experimental PR 14997 results:

  • Functional HW Medium: after running svc_ops_entry_age:60 warmup loop of container open/close, a time seen for for svc_ops_enabled;1 1000 loops of container open/close was 2.969 seconds
2024-08-28 18:31:12,452 process          L0416 DEBUG| [stdout] [1,0]<stdout>:1K open/close pairs in    2.969 sec
  • Functional HW Medium MD on SSD: after running svc_ops_entry_age:60 warmup loop of container open/close, a time seen for for svc_ops_enabled;1 1000 loops of container open/close was 1.617 seconds (versus 19.x seconds seen in this PR)
2024-08-28 18:27:15,163 process          L0416 DEBUG| [stdout] [1,0]<stdout>:1K open/close pairs in    1.617 sec

kccain avatar Aug 23 '24 12:08 kccain

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/15/testReport/

daosbuild1 avatar Aug 27 '24 16:08 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/15/testReport/

daosbuild1 avatar Aug 27 '24 16:08 daosbuild1

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/16/testReport/

daosbuild1 avatar Aug 27 '24 19:08 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/16/testReport/

daosbuild1 avatar Aug 27 '24 20:08 daosbuild1

updated my comment from a few days ago, after updating the debug PR to get a proper 1000 container open/close loop timing in the daos_test -c co_op_dup_timing() with svc_ops_enabled:1 and svc_ops_entry_age:60, and preceding the 1000 loop timing with a 60 second warmup loop.

The timings measured were 2.969 seconds (Functional HW Medium) and 1.617 seconds (Functional HW Medium MD on SSD)

kccain avatar Aug 28 '24 19:08 kccain