ovn-kubernetes
ovn-kubernetes copied to clipboard
move ovnkube_node_cni_request_duration metrics to linear buckets
The buckets defined for this metric was: -- prometheus.ExponentialBuckets(.1, 2, 15) which expands to: [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2, 102.4, ...]
This doesn't provide accurate information for analysis because we found that most of the P99 latency falls in the range between 1.6 - 6.4, so we only see the value 1.6, 3.2, 6.4, while it would be better to see if it is for example, 2, 3, 4, or 5 seconds for the CNI request duration.
Furthermoe, our goal is to provide networking to Pods sooner tha 15s instead of pow(2, 15)!
This patch changes it to use linearBuckets instead of Exponential.
Authored-by: Han Zhou [email protected]
Deploy Preview for subtle-torrone-bb0c84 ready!
Name | Link |
---|---|
Latest commit | 0e07b9adfe178e83c23dc49ad4f7aa2e6c4b3b91 |
Latest deploy log | https://app.netlify.com/sites/subtle-torrone-bb0c84/deploys/6614b925dbca8700078bf614 |
Deploy Preview | https://deploy-preview-4263--subtle-torrone-bb0c84.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Changes unknown when pulling eb5374a7338a7c090036a6daca4b32e430cbd768 on girishmg:us-move-to-linear into ** on ovn-org:master**.
Consulting our scale testing team on the range of values seen for this metric.
Hey @girishmg , i consulted our scale team and they said P99 is around 2s for kubelet density [1] and 13s for cluster-density without specifying how large the cluster was.
However, for a 500 node cluster, they said they saw 22s being reported.
I think 60 time series for one metric wouldnt be recommend downstream for us.
Ideally, Id like around 15 time series and also to consider scaling increases in the future (buffer for further latency). Its not easy to fit all requirements though.. any ideas?
[1] https://github.com/kube-burner/kube-burner/tree/main/examples/workloads
This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it.