crossplane-runtime Workqueue depth and duration metrics are not accurate

What happened?

During scalability testing efforts leading https://github.com/crossplane-contrib/provider-kubernetes/pull/203, I noticed that the controller runtime metrics like workqueue_depth and workqueue_queue_duration_seconds are not accurate or not reflecting the state of the system as expected.

See the workqueue depth and duration graphs for "1m, 10" here.

How can we reproduce it?

Checkout https://github.com/turkenh/provider-kubernetes-scalability/tree/repro-xp-no-metric

just setup
just create_x_objects 1 1000

just help_launch_prometheus

just help_launch_grafana
# import dashboard json there

What environment did it happen in?

Crossplane version: v1.14.5 Provider Kubernetes: v0.11.4

Feb 23 '24 12:02 turkenh

@turkenh Do you have any theories on why depth is inaccurate? I think per https://github.com/crossplane/crossplane/pull/5415#issuecomment-1955972430 we have a good theory why duration would be inaccurate.

Feb 26 '24 23:02 negz

Ah - maybe rate limiters calling AddAfter (as in "add to queue after some duration") means reconciles are spending time in limbo. They're not yet being processed, but also not yet even in the work queue?

Feb 26 '24 23:02 negz

Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

Sep 03 '24 09:09 github-actions[bot]