contour add metric(s) for status update load

Please describe the problem you have When debugging issues like #5001 it isn't clear at the moment via metrics how the status update components of Contour are working. This includes knowing the frequency/duration/etc. of status updates so we can find where bottlenecks might be. Updating status is often a resource intensive operation for the leader instance of Contour so the more information we can get, the better.

We should add some metrics that can help us get a picture of what is happening, some ideas:

[x] status update count total (per resource kind)
[x] status update count succeeded (per resource kind)
[x] status update failure count (per resource kind)
[x] summary/histogram of status update duration (segmented between failures and success?)
[ ] number of status updates per dag run
[ ] total duration of status updates per dag run

Jan 25 '23 23:01 sunjayBhatia

Other ideas to look at:

number of status updates (excluding no-ops) per DAG run
total duration of status updates per DAG run

May 15 '23 18:05 skriss

number of status updates (excluding no-ops) per DAG run

I think when aggregating/visualizing this is a little tricky to do "per DAG run"? would we want to capture a distribution for this vs. what we get in https://github.com/projectcontour/contour/pull/5037 would be to look at the counters status_update_total - status_update_noop_total (I think we could correlate the latter with plotting contour_dagrebuild_total to figure out when DAG rebuilds happen? would have to experiment a bit)

May 19 '23 18:05 sunjayBhatia

total duration of status updates per DAG run

I wonder if we can extrapolate this out indirectly from contour_status_update_duration_seconds_sum etc., or of it's just better to collect the total metric, will have to play around a bit

May 19 '23 18:05 sunjayBhatia

look at the counters status_update_total - status_update_noop_total (I think we could correlate the latter with plotting contour_dagrebuild_total to figure out when DAG rebuilds happen? would have to experiment a bit)

Yeah, that's fair, that would probably get us close enough anyway.

I wonder if we can extrapolate this out indirectly from contour_status_update_duration_seconds_sum etc., or of it's just better to collect the total metric, will have to play around a bit

Yeah, if we can find a way to derive it that's fine, I think this would just help indicate whether the queue is backing up or not -- individual status updates would still be ~quick, but if there are a large # in the queue, and we're only doing 5 per second, then updates will get backed up.

May 19 '23 19:05 skriss

Doing the above in a follow-up PR ^

May 23 '23 19:05 sunjayBhatia

Bumping to 1.27

Aug 16 '23 15:08 skriss

added some tasks to the checklist above to capture in the main issue description what has been suggested in the comments so we can keep this around and work on it in the future

Oct 10 '23 15:10 sunjayBhatia

picking this back up

Mar 12 '24 20:03 sunjayBhatia

For status updates per dag rebuild:

create 3 httpproxies (2 invalid), sleep for 2s, delete 3 httpproxies, sleep 1s
rate of status updates overall is ~8 and rate of noop status updates is ~5, diff is ~3
rate of dag rebuilds is ~1.2
means ~2.5 status updates per dag rebuild which makes sense given kubectl apply|delete are creating/deleting things close together

Mar 15 '24 01:03 sunjayBhatia

re: status update duration per dag run

Its a little tricky to measure this directly since we pass off the status updates after the dag is built from the event handler to the status update handler, which uses a channel of size 100 to queue the status updates

I artificially set up a slow status update scenario by adding a sleep to the status update handler (right before status updates are actually written) and can see an observable slowdown in how quickly newly created httpproxies are updated with valid/invalid status (creating/deleting 100 of them in a loop)

the slowdown is shown both in the rate of status updates that occur as well as the distribution of status update durations, I think this should be sufficient to diagnose any issues in the status update queue; if DAG rebuild duration is low and there is a slowdown here we can be pretty confident this is the issue

Mar 15 '24 20:03 sunjayBhatia

closing this issue as completed for now, but if anyone has concerns we can reopen!

Mar 15 '24 20:03 sunjayBhatia

contour contour copied to clipboard

add metric(s) for status update load

contour
contour copied to clipboard