traefik-proxy
traefik-proxy copied to clipboard
explore activity tracking via metrics
Proposed change
The only feature CHP has that traefik does not is network-level activity tracking. This isn't generally critical, as activity tracking is now published from the single-user server (this was added mainly to enable traefik in the first place).
The one situation where this is required is unauthenticated BinderHub, where single-user servers do not report their activity because they are not actually jupyterhub servers. The result is that BinderHub without auth cannot enable idle-culling with a traefik proxy, because all servers are always considered idle if they don't report any activity.
We don't have the same hooks for traefik that we do, but traefik does have metrics, which may provide good enough information.
Alternative options
Who would use this feature?
JupyterHub deployments that wish to use traefik with a default BinderHub (or any other alternative single-user server implementation that may not implement internal activity tracking). For example: mybinder.org.
(Optional): Suggest a solution
If we scrape a traefik metrics endpoint, e.g. prometheus, I believe we can get a low-resolution 'did anything happen' metric, which ought to be good enough. I think we can infer that if any of the router metrics for a given server have changed, there has been activity since the last check.
This should be off by default, because it is only really useful in the BinderHub case (or similar), and may be potentially costly.
Would the implementatin you consider enable the proxy class to report activity via get_all_routes
back to jupyterhub, so that jupyterhub-idle-culler can make better informed decisions via jupyterhub's REST API?
I'm asking because I'd like to reference an issue to track the status of traefik-proxy's current inability to report network activity on its routes from jupyterhub-idle-culler's readme.
Yes, that's exactly what I'm thinking, and specifically for the mybinder.org case where jupyterhub-singleuser
is not involved. Where jupyterhub-singleuser is involved, activity tracking should already be better than proxy activity tracking, so this should not be necessary. I'm mainly thinking about mybinder.org, where the proxy is the only source of last_activity data. The other solution for mybinder.org is an activity proxy sidecar for user pods, which would also solve this problem.
I looked into this a bit today, and was able to collect the traefik_router_requests_bytes_total
metric to check if any data has passed through. But replicas make this quite a bit more complicated, and replicas are kind of the reason we are trying to use traefik in the first place. This is because metrics are per-replica. We'd have to do aggregations across replicas, which means:
- discovery of replicas
- scraping and aggregating across multiple replicas
So, to me, that mostly means:
- only using a single replica (misses much of the point), or
- getting the data from the metrics aggregator (i.e. prometheus-server) instead of directly from traefik
The first would make the whole feature unavailable in the one place we want it. The upside of 2 is prometheus tends to be running where we want this. The downside is that these metrics don't really want to be public - exposing the metrics we need exposes usernames and URLs of currently active servers. That leads one to choose an authenticated prometheus instance, which e.g. mybinder.org doesn't have, unless we change how we name routers to be opaque (they still need to be deterministic, but not reversible, so a hash function would be valid, though it would lead to very ugly metric labels). And even opaque, it reveals some (very coarse) data about individual server behavior, even if the individuals aren't immediately known (they might be deduced via other means).
The other upside is we wouldn't need to handle deltas, since we can use increase
queries and time ranges in prometheus, rather than needing to do our own loop of:
- measure
- measure again
- compare with previous measurement
which means we need to implement storing measurements. If we were talking to prometheus instead of scraping prometheus metrics, we wouldn't need any state, and could do much simpler sum(increase(traefik_router_requests_bytes_total[5m])) by (router)
to immediately tell us all the servers with any traffic in the last 5 minutes.
We could elect to run a dedicated prometheus instance just for these metrics. Not the full prometheus operator, but a dedicated instance with:
- ~30minute retention
- authentication
That's starting to complicate deployment quite a bit, I imagine, since the prometheus would still need to be configured to discover traefik.
I'm not sure what the best path is right now. Since this is such a specific case (essentially only anonymous binderhub), I am starting to be inclined to implement this against a prometheus server, and say it only really exists for anonymous binderhub, where active URLs aren't meaningful or identifying (and are probably already in prometheus somewhere else).
This is really a special feature for anonymous BinderHub, so another option for this would be to implement this directly in the binderhub chart:
- enable traefik metrics
- make sure prometheus scrapes traefik
- run a jupyterhub service with
users:activity
scope that pulls from traefik metrics via prometheus (or via k8s discovery of traefik directly), and then pushes
It makes a certain amount of sense to do here because we already have the api calls to talk to traefik and the configuration required, but all the trade offs of enabling this only make sense when you're not launching jupyterhub-singleuser, which is exclusively anonymous binderhub in practice.