ray icon indicating copy to clipboard operation
ray copied to clipboard

Add API latency and call counts metrics to dashboard APIs

Open alanwguo opened this issue 2 years ago • 5 comments

Signed-off-by: Alan Guo [email protected]

Why are these changes needed?

Adds basic latency and call count metrics for dashboard API endpoints. This willl allow us to more easily debug issues where the dashboard apis are unresponsive or slow.

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

alanwguo avatar Sep 03 '22 01:09 alanwguo

@rkooo567 @rickyyx , simple PR to add prom metrics to dashboard apis. It should work but I'm not seeing the custom metrics in the metrics export port.

Any ideas what's wrong? I do see the middleware is being called because I see the debug statements.

Also, instead of using ray custom metrics, should I be using CythonHistogram / CythonCounter directly so that this is a "default metric" rather than a "custom metric"?

alanwguo avatar Sep 03 '22 01:09 alanwguo

Yeah, I was able to reproduce what you described on my end as well

Does python ray/doc/source/ray-observability/doc_code/metrics_example.py produce the desired metrics for you? I am seeing the example working.

rickyyx avatar Sep 06 '22 17:09 rickyyx

Can you update the PR description?

rkooo567 avatar Sep 14 '22 13:09 rkooo567

Lmk when tests pass + addressed comments!

rkooo567 avatar Sep 14 '22 15:09 rkooo567

@richardliaw @ericl ready for re-review

alanwguo avatar Sep 22 '22 01:09 alanwguo