DAOS-15426 telemetry: Export missing stats metrics
The Prometheus exporter is missing a few stats metrics that would make some things easier to graph:
- sum
- sample_size
- sum_of_squares
Fixes the Min/Max/Sum methods to return uint64, as this is the underlying data type. Callers should adjust as necessary.
Required-githooks: true Change-Id: I5f5a6864cc400b1f146c18723a1a5dec8d0c3b2a Signed-off-by: Michael MacDonald [email protected]
Ticket title is 'Prometheus metrics do not include sample_size for stats metrics' Status is 'In Review' Errors are Unknown component https://daosio.atlassian.net/browse/DAOS-15426
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13979/2/display/redirect
OK, as usual, touching .py files in a PR increased the toil on this work, but looks like it's all sorted out now:
Preparing to run the ./control/dmg_telemetry_basic.py test on repeat 1/1
[Test 10/31] Running the ./control/dmg_telemetry_basic.py test on repetition 1/1
JOB ID : 1b78ea66e87980e8f5d8bf8152b6948a2a4766a8
JOB LOG : /var/tmp/ftest/avocado/job-results/job-2024-03-14T16.46-1b78ea6/job.log
(1/2) ./control/dmg_telemetry_basic.py:TestWithTelemetryBasic.test_telemetry_list;run-container-hosts-pool-server_config-engines-0-storage-0-test-timeouts-2f2e: PASS (89.15 s)
(2/2) ./control/dmg_telemetry_basic.py:TestWithTelemetryBasic.test_container_telemetry;run-container-hosts-pool-server_config-engines-0-storage-0-test-timeouts-2f2e: CANCEL: Skipping until DAOS-8720 is fixed. (0.42 s)
RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 1
That test uses TelemetryUtils.get_all_server_metrics() which returns the giant list of expected metric names that has been updated in this PR.
@daltonbohning, @phender: You guys good with the test changes? I think the helper makes this far more maintainable.
Functional on EL 8.8 Test Results (old)
135 tests 131 :white_check_mark: 1h 39m 2s :stopwatch: 41 suites 4 :zzz: 41 files 0 :x:
Results for commit cc4c4b49.
Functional on EL 9 Test Results (old)
135 tests 131 :white_check_mark: 1h 50m 54s :stopwatch: 41 suites 4 :zzz: 41 files 0 :x:
Results for commit cc4c4b49.
why the force push?
I pushed a commit to fix a typo in the merge commit, realized quickly that the fix had another typo, and force-pushed a fix for that into the top commit. I didn't force-push any of the earlier commits in the stack, so the "changes since last review" feature still works as usual.
Functional Hardware Large Test Results (old)
64 tests 64 :white_check_mark: 28m 49s :stopwatch: 14 suites 0 :zzz: 14 files 0 :x:
Results for commit cc4c4b49.
Overall LGTM. In general, we have to be careful because some tests do this. (arguably that's bad practice but it requires some rework). But I think we're okay here. And several telemetry tests are in
prhttps://github.com/daos-stack/daos/blob/29741271b88f76a11a6a648a7a7571ddf9da73b9/src/tests/ftest/telemetry/dkey_akey_enum_punch.py#L361-L362
Ugh, that's terrible. Would be better to create some helpers to pluck out the correct names or something like that.
Raaargh. The Functional HW-Medium testing stages ran, but failed to upload their results to GitHub.
HW-Medium: https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448346
All avocado tests passed!
...
2024-03-15 17:17:58 +0000 - publish - INFO - Finished reading 34 files in 0.01 seconds
Traceback (most recent call last):
File "/action/publish_test_results.py", line 546, in <module>
main(settings, gha)
File "/action/publish_test_results.py", line 269, in main
Publisher(settings, gh, gha).publish(stats, results.case_results, conclusion)
File "/action/publish/publisher.py", line 209, in __init__
self._repo = gh.get_repo(self._settings.repo)
File "/usr/local/lib/python3.8/site-packages/github/MainClass.py", line 3[80](https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448346#step:10:81), in get_repo
headers, data = self.__requester.requestJsonAndCheck("GET", url)
File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 494, in requestJsonAndCheck
return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 525, in __check
raise self.createException(status, responseHeaders, data)
github.GithubException.BadCredentialsException: 401 {"message": "Bad credentials", "documentation_url": "https://docs.github.com/rest"}
The HW-Medium (Verbs) tests ran and mostly passed, but something crashed and then the status upload failed: https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448607
Interrupted avocado jobs detected!
ERROR: Core stack trace files detected!
...
2024-03-15 19:20:22 +0000 - publish - INFO - Reading JUnit XML files Functional Hardware Medium Verbs Provider/**/results.xml (7 files, 222.5 KiB)
2024-03-15 19:20:22 +0000 - publish - INFO - Finished reading 7 files in 0.01 seconds
Traceback (most recent call last):
File "/action/publish_test_results.py", line 546, in <module>
main(settings, gha)
File "/action/publish_test_results.py", line 269, in main
Publisher(settings, gh, gha).publish(stats, results.case_results, conclusion)
File "/action/publish/publisher.py", line 209, in __init__
self._repo = gh.get_repo(self._settings.repo)
File "/usr/local/lib/python3.8/site-packages/github/MainClass.py", line 3[80](https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448607#step:10:81), in get_repo
headers, data = self.__requester.requestJsonAndCheck("GET", url)
File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 494, in requestJsonAndCheck
return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 525, in __check
raise self.createException(status, responseHeaders, data)
github.GithubException.BadCredentialsException: 401 {"message": "Bad credentials", "documentation_url": "https://docs.github.com/rest"}
Downloaded the archive and looked at the crash... Seems like the usual orterun not liking to be cancelled thing:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f59b77e6638 in PMIx_Finalize ()
from /usr/mpi/gcc/openmpi-4.1.5a1/lib64/libpmix.so.2
[Current thread is 1 (Thread 0x7f59bd65f6c0 (LWP 102641))]
No other stack traces in the archive. As for the actual failure, looks like the following:
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 15:43:52,667 test L0530 INFO | START 18-./daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple;run-agent_config-transport_config-daos_tests-args-daos_test-num_clients
-pools_created-scalable_endpoint-stopped_ranks-test_name-dmg-hosts-pool-server_config-engines-0-storage-0-1-timeouts-c53b
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 15:43:53,150 test L0479 INFO | ==> Step 1: setUp(): Starting servers [elapsed since last step: 0.49s]
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 15:43:57,710 test L0479 INFO | ==> Step 2: setUp(): Starting agents [elapsed since last step: 4.56s]
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 15:43:58,216 test L0479 INFO | ==> Step 3: setUp(): Destroying any existing pools before the test [elapsed since last step: 0.51s]
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 15:43:59,319 test L0479 INFO | ==> Step 4: setUp(): Setup complete [elapsed since last step: 1.10s]
2024/03/15 07:20:00 DEBUG run_local: 2024-03-15 16:15:23,117 test L0479 INFO | ==> Step 5: tearDown(): Called due to exceeding the 1890s test timeout [elapsed since last step: 1883.80s]
No exact matches in JIRA, but a few similar issues. In any case, I'm reasonably confident that this failure in the rebuild tests is not due to my adding a few stats metrics in the Prometheus exporter.
@daos-stack/daos-gatekeeper: Please force-land this when possible.