daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-15426 telemetry: Export missing stats metrics

Open mjmac opened this issue 1 year ago • 11 comments

The Prometheus exporter is missing a few stats metrics that would make some things easier to graph:

  • sum
  • sample_size
  • sum_of_squares

Fixes the Min/Max/Sum methods to return uint64, as this is the underlying data type. Callers should adjust as necessary.

Required-githooks: true Change-Id: I5f5a6864cc400b1f146c18723a1a5dec8d0c3b2a Signed-off-by: Michael MacDonald [email protected]

mjmac avatar Mar 13 '24 19:03 mjmac

Ticket title is 'Prometheus metrics do not include sample_size for stats metrics' Status is 'In Review' Errors are Unknown component https://daosio.atlassian.net/browse/DAOS-15426

github-actions[bot] avatar Mar 13 '24 19:03 github-actions[bot]

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13979/2/display/redirect

daosbuild1 avatar Mar 13 '24 20:03 daosbuild1

OK, as usual, touching .py files in a PR increased the toil on this work, but looks like it's all sorted out now:

Preparing to run the ./control/dmg_telemetry_basic.py test on repeat 1/1
[Test 10/31] Running the ./control/dmg_telemetry_basic.py test on repetition 1/1
JOB ID     : 1b78ea66e87980e8f5d8bf8152b6948a2a4766a8
JOB LOG    : /var/tmp/ftest/avocado/job-results/job-2024-03-14T16.46-1b78ea6/job.log
 (1/2) ./control/dmg_telemetry_basic.py:TestWithTelemetryBasic.test_telemetry_list;run-container-hosts-pool-server_config-engines-0-storage-0-test-timeouts-2f2e:  PASS (89.15 s)
  (2/2) ./control/dmg_telemetry_basic.py:TestWithTelemetryBasic.test_container_telemetry;run-container-hosts-pool-server_config-engines-0-storage-0-test-timeouts-2f2e:  CANCEL: Skipping until DAOS-8720 is fixed. (0.42 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 1

That test uses TelemetryUtils.get_all_server_metrics() which returns the giant list of expected metric names that has been updated in this PR.

mjmac avatar Mar 14 '24 16:03 mjmac

@daltonbohning, @phender: You guys good with the test changes? I think the helper makes this far more maintainable.

mjmac avatar Mar 14 '24 16:03 mjmac

Functional on EL 8.8 Test Results (old)

135 tests   131 :white_check_mark:  1h 39m 2s :stopwatch:  41 suites    4 :zzz:  41 files      0 :x:

Results for commit cc4c4b49.

github-actions[bot] avatar Mar 14 '24 17:03 github-actions[bot]

Functional on EL 9 Test Results (old)

135 tests   131 :white_check_mark:  1h 50m 54s :stopwatch:  41 suites    4 :zzz:  41 files      0 :x:

Results for commit cc4c4b49.

github-actions[bot] avatar Mar 14 '24 17:03 github-actions[bot]

why the force push?

I pushed a commit to fix a typo in the merge commit, realized quickly that the fix had another typo, and force-pushed a fix for that into the top commit. I didn't force-push any of the earlier commits in the stack, so the "changes since last review" feature still works as usual.

mjmac avatar Mar 14 '24 18:03 mjmac

Functional Hardware Large Test Results (old)

64 tests   64 :white_check_mark:  28m 49s :stopwatch: 14 suites   0 :zzz: 14 files     0 :x:

Results for commit cc4c4b49.

github-actions[bot] avatar Mar 15 '24 13:03 github-actions[bot]

Overall LGTM. In general, we have to be careful because some tests do this. (arguably that's bad practice but it requires some rework). But I think we're okay here. And several telemetry tests are in pr

https://github.com/daos-stack/daos/blob/29741271b88f76a11a6a648a7a7571ddf9da73b9/src/tests/ftest/telemetry/dkey_akey_enum_punch.py#L361-L362

Ugh, that's terrible. Would be better to create some helpers to pluck out the correct names or something like that.

mjmac avatar Mar 15 '24 15:03 mjmac

Raaargh. The Functional HW-Medium testing stages ran, but failed to upload their results to GitHub.

HW-Medium: https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448346

All avocado tests passed!
...
2024-03-15 17:17:58 +0000 - publish -  INFO - Finished reading 34 files in 0.01 seconds
Traceback (most recent call last):
  File "/action/publish_test_results.py", line 546, in <module>
    main(settings, gha)
  File "/action/publish_test_results.py", line 269, in main
    Publisher(settings, gh, gha).publish(stats, results.case_results, conclusion)
  File "/action/publish/publisher.py", line 209, in __init__
    self._repo = gh.get_repo(self._settings.repo)
  File "/usr/local/lib/python3.8/site-packages/github/MainClass.py", line 3[80](https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448346#step:10:81), in get_repo
    headers, data = self.__requester.requestJsonAndCheck("GET", url)
  File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 494, in requestJsonAndCheck
    return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
  File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 525, in __check
    raise self.createException(status, responseHeaders, data)
github.GithubException.BadCredentialsException: 401 {"message": "Bad credentials", "documentation_url": "https://docs.github.com/rest"}

The HW-Medium (Verbs) tests ran and mostly passed, but something crashed and then the status upload failed: https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448607

Interrupted avocado jobs detected!
ERROR: Core stack trace files detected!
...
2024-03-15 19:20:22 +0000 - publish -  INFO - Reading JUnit XML files Functional Hardware Medium Verbs Provider/**/results.xml (7 files, 222.5 KiB)
2024-03-15 19:20:22 +0000 - publish -  INFO - Finished reading 7 files in 0.01 seconds
Traceback (most recent call last):
  File "/action/publish_test_results.py", line 546, in <module>
    main(settings, gha)
  File "/action/publish_test_results.py", line 269, in main
    Publisher(settings, gh, gha).publish(stats, results.case_results, conclusion)
  File "/action/publish/publisher.py", line 209, in __init__
    self._repo = gh.get_repo(self._settings.repo)
  File "/usr/local/lib/python3.8/site-packages/github/MainClass.py", line 3[80](https://github.com/daos-stack/daos/actions/runs/8283417265/job/22672448607#step:10:81), in get_repo
    headers, data = self.__requester.requestJsonAndCheck("GET", url)
  File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 494, in requestJsonAndCheck
    return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url)))
  File "/usr/local/lib/python3.8/site-packages/github/Requester.py", line 525, in __check
    raise self.createException(status, responseHeaders, data)
github.GithubException.BadCredentialsException: 401 {"message": "Bad credentials", "documentation_url": "https://docs.github.com/rest"}

Downloaded the archive and looked at the crash... Seems like the usual orterun not liking to be cancelled thing:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f59b77e6638 in PMIx_Finalize ()
   from /usr/mpi/gcc/openmpi-4.1.5a1/lib64/libpmix.so.2
[Current thread is 1 (Thread 0x7f59bd65f6c0 (LWP 102641))]

No other stack traces in the archive. As for the actual failure, looks like the following:

2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 15:43:52,667 test             L0530 INFO | START 18-./daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple;run-agent_config-transport_config-daos_tests-args-daos_test-num_clients
-pools_created-scalable_endpoint-stopped_ranks-test_name-dmg-hosts-pool-server_config-engines-0-storage-0-1-timeouts-c53b
2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 15:43:53,150 test             L0479 INFO | ==> Step 1: setUp(): Starting servers [elapsed since last step: 0.49s]
2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 15:43:57,710 test             L0479 INFO | ==> Step 2: setUp(): Starting agents [elapsed since last step: 4.56s]
2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 15:43:58,216 test             L0479 INFO | ==> Step 3: setUp(): Destroying any existing pools before the test [elapsed since last step: 0.51s]
2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 15:43:59,319 test             L0479 INFO | ==> Step 4: setUp(): Setup complete [elapsed since last step: 1.10s]
2024/03/15 07:20:00 DEBUG                      run_local:     2024-03-15 16:15:23,117 test             L0479 INFO | ==> Step 5: tearDown(): Called due to exceeding the 1890s test timeout [elapsed since last step: 1883.80s]

No exact matches in JIRA, but a few similar issues. In any case, I'm reasonably confident that this failure in the rebuild tests is not due to my adding a few stats metrics in the Prometheus exporter.

mjmac avatar Mar 15 '24 20:03 mjmac

@daos-stack/daos-gatekeeper: Please force-land this when possible.

mjmac avatar Mar 15 '24 20:03 mjmac