DAOS-15937 test: Automate metadata duplicate rpc detection time consuming
Automate metadata duplicate rpc detection time consuming
Skip-unit-tests: true Skip-fault-injection-test: true Skip-func-hw-test-large-md-on-ssd: false Test-tag: test_metadata_dup_rpc Doc-only: false Required-githooks: true
Signed-off-by: Ding Ho [email protected]
Before requesting gatekeeper:
- [ ] Two review approvals and any prior change requests have been resolved.
- [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
- [ ]
Features:(orTest-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR. - [ ] Commit messages follows the guidelines outlined here.
- [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.
Gatekeeper:
- [ ] You are the appropriate gatekeeper to be landing the patch.
- [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
- [ ] Githooks were used. If not, request that user install them and check copyright dates.
- [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
- [ ] All builds have passed. Check non-required builds for any new compiler warnings.
- [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
- [ ] If applicable, the PR has addressed any potential version compatibility issues.
- [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
- [ ] Extra checks if forced landing is requested
- [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
- [ ] No new NLT or valgrind warnings. Check the classic view.
- [ ] Quick-build or Quick-functional is not used.
- [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.
Errors are Unable to load ticket data https://daosio.atlassian.net/browse/DAOS-15937
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/2/testReport/
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/143/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/267/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/328/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/7/execution/node/329/log
Test stage Python Bandit check completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/143/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/317/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/381/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/353/log
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14473/8/execution/node/285/log
Please review the new updates.
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/12/testReport/
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/14/testReport/
The job.log has 200,000 lines of output, 100,000 pairs of items like this, that are output during the timing loops. We may want to look at making the output quiet. I think I had experimented with this at some point, but not sure if I can find the prototype code I used since it has been a while.
2024-08-23 08:32:00,394 test_utils_conta L0624 INFO | Opening container 138475AD-324F-4A86-8E25-D2D213FAABEB
2024-08-23 08:32:00,399 test_utils_conta L0645 INFO | Closing container 138475AD-324F-4A86-8E25-D2D213FAABEB
Failure reason
#Dup rpc detection time 13.781859397888184 > baseline_time 5.202355599403381 * threshold_factor 2.5
but the actual time comparison should be 5.x versus 19..x seconds, which is the "stabilized" time after getting through the initial svc_ops_entry_age:60 seconds period
From the job.log, the first 10 test-loops with svc_ops_enabled:1
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 0, 8.415847
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 1, 13.758781
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 2, 18.333476
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 3, 19.055028
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 4, 19.116618
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 5, 19.093971
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 6, 19.083379
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 7, 19.085228
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 8, 19.120708
2024-08-23 08:34:54,578 metadata_svc_ops L0060 INFO | Test time of Metadata test-loop: 9, 19.120152
I have an experimental PR https://github.com/daos-stack/daos/pull/14997 to try to establish a baseline timing expectation for pools with svc_ops_enabled:1 and svc_ops_entry_age:60, on a variety of HW cluster configurations, by running the existing C code daos_test -c --subtests="34" (co_op_dup_timing).
That test is run daily, but is done on Functional Hardware Medium Verbs Provider, and Functional Hardware Medium Verbs Provider MD on SSD configurations (whereas this PR executes on Functional Hardware Medium, and Functional Hardware Medium MD on SSD).
Experimental PR 14997 results:
- Functional HW Medium: after running svc_ops_entry_age:60 warmup loop of container open/close, a time seen for for svc_ops_enabled;1 1000 loops of container open/close was 2.969 seconds
2024-08-28 18:31:12,452 process L0416 DEBUG| [stdout] [1,0]<stdout>:1K open/close pairs in 2.969 sec
- Functional HW Medium MD on SSD: after running svc_ops_entry_age:60 warmup loop of container open/close, a time seen for for svc_ops_enabled;1 1000 loops of container open/close was 1.617 seconds (versus 19.x seconds seen in this PR)
2024-08-28 18:27:15,163 process L0416 DEBUG| [stdout] [1,0]<stdout>:1K open/close pairs in 1.617 sec
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/15/testReport/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/15/testReport/
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/16/testReport/
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14473/16/testReport/
updated my comment from a few days ago, after updating the debug PR to get a proper 1000 container open/close loop timing in the daos_test -c co_op_dup_timing() with svc_ops_enabled:1 and svc_ops_entry_age:60, and preceding the 1000 loop timing with a 60 second warmup loop.
The timings measured were 2.969 seconds (Functional HW Medium) and 1.617 seconds (Functional HW Medium MD on SSD)