ovn-kubernetes icon indicating copy to clipboard operation
ovn-kubernetes copied to clipboard

move ovnkube_node_cni_request_duration metrics to linear buckets

Open girishmg opened this issue 10 months ago • 2 comments

The buckets defined for this metric was: -- prometheus.ExponentialBuckets(.1, 2, 15) which expands to: [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, ...]

This doesn't provide accurate information for analysis because we found that most of the P99 latency falls in the range between 1.6 - 6.4, so we only see the value 1.6, 3.2, 6.4, while it would be better to see if it is for example, 2, 3, 4, or 5 seconds for the CNI request duration.

Furthermoe, our goal is to provide networking to Pods sooner tha 15s instead of pow(2, 15)!

This patch changes it to use linearBuckets instead of Exponential.

Authored-by: Han Zhou [email protected]

girishmg avatar Apr 09 '24 03:04 girishmg

Deploy Preview for subtle-torrone-bb0c84 ready!

Name Link
Latest commit 0e07b9adfe178e83c23dc49ad4f7aa2e6c4b3b91
Latest deploy log https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/6614b925dbca8700078bf614
Deploy Preview https://deploy-preview-4263--subtle-torrone-bb0c84.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Apr 09 '24 03:04 netlify[bot]

Coverage Status

Changes unknown when pulling eb5374a7338a7c090036a6daca4b32e430cbd768 on girishmg:us-move-to-linear into ** on ovn-org:master**.

coveralls avatar Apr 09 '24 04:04 coveralls

Consulting our scale testing team on the range of values seen for this metric.

martinkennelly avatar May 29 '24 08:05 martinkennelly

Hey @girishmg , i consulted our scale team and they said P99 is around 2s for kubelet density [1] and 13s for cluster-density without specifying how large the cluster was.

However, for a 500 node cluster, they said they saw 22s being reported.

I think 60 time series for one metric wouldnt be recommend downstream for us.

Ideally, Id like around 15 time series and also to consider scaling increases in the future (buffer for further latency). Its not easy to fit all requirements though.. any ideas?

[1] https://github.com/kube-burner/kube-burner/tree/main/examples/workloads

martinkennelly avatar May 30 '24 14:05 martinkennelly

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it.

github-actions[bot] avatar Aug 29 '24 01:08 github-actions[bot]