contour
contour copied to clipboard
add metric(s) for status update load
Please describe the problem you have When debugging issues like #5001 it isn't clear at the moment via metrics how the status update components of Contour are working. This includes knowing the frequency/duration/etc. of status updates so we can find where bottlenecks might be. Updating status is often a resource intensive operation for the leader instance of Contour so the more information we can get, the better.
We should add some metrics that can help us get a picture of what is happening, some ideas:
- [x] status update count total (per resource kind)
- [x] status update count succeeded (per resource kind)
- [x] status update failure count (per resource kind)
- [x] summary/histogram of status update duration (segmented between failures and success?)
- [ ] number of status updates per dag run
- [ ] total duration of status updates per dag run
Other ideas to look at:
- number of status updates (excluding no-ops) per DAG run
- total duration of status updates per DAG run
number of status updates (excluding no-ops) per DAG run
I think when aggregating/visualizing this is a little tricky to do "per DAG run"? would we want to capture a distribution for this vs. what we get in https://github.com/projectcontour/contour/pull/5037 would be to look at the counters status_update_total - status_update_noop_total (I think we could correlate the latter with plotting contour_dagrebuild_total to figure out when DAG rebuilds happen? would have to experiment a bit)
total duration of status updates per DAG run
I wonder if we can extrapolate this out indirectly from contour_status_update_duration_seconds_sum etc., or of it's just better to collect the total metric, will have to play around a bit
look at the counters status_update_total - status_update_noop_total (I think we could correlate the latter with plotting contour_dagrebuild_total to figure out when DAG rebuilds happen? would have to experiment a bit)
Yeah, that's fair, that would probably get us close enough anyway.
I wonder if we can extrapolate this out indirectly from contour_status_update_duration_seconds_sum etc., or of it's just better to collect the total metric, will have to play around a bit
Yeah, if we can find a way to derive it that's fine, I think this would just help indicate whether the queue is backing up or not -- individual status updates would still be ~quick, but if there are a large # in the queue, and we're only doing 5 per second, then updates will get backed up.
Doing the above in a follow-up PR ^
Bumping to 1.27
added some tasks to the checklist above to capture in the main issue description what has been suggested in the comments so we can keep this around and work on it in the future
picking this back up
For status updates per dag rebuild:
- create 3 httpproxies (2 invalid), sleep for 2s, delete 3 httpproxies, sleep 1s
- rate of status updates overall is ~8 and rate of noop status updates is ~5, diff is ~3
- rate of dag rebuilds is ~1.2
- means ~2.5 status updates per dag rebuild which makes sense given
kubectl apply|deleteare creating/deleting things close together
re: status update duration per dag run
Its a little tricky to measure this directly since we pass off the status updates after the dag is built from the event handler to the status update handler, which uses a channel of size 100 to queue the status updates
I artificially set up a slow status update scenario by adding a sleep to the status update handler (right before status updates are actually written) and can see an observable slowdown in how quickly newly created httpproxies are updated with valid/invalid status (creating/deleting 100 of them in a loop)
the slowdown is shown both in the rate of status updates that occur as well as the distribution of status update durations, I think this should be sufficient to diagnose any issues in the status update queue; if DAG rebuild duration is low and there is a slowdown here we can be pretty confident this is the issue
closing this issue as completed for now, but if anyone has concerns we can reopen!