Metrics scaling
Problem
At the moment, benji-{backup,restore}-pvc scripts push metrics to pushgateway immediately upon wrapped benji process exit, which is likely good enough for many use-cases.
In our case, however, I back up ~20k volumes in parallel with something like this (simplified):
kubectl get pvc -n xyz -o=custom-columns=:.metadata.name --no-headers \
| xargs -I{} -P32 -rn1 benji-backup-pvc --field-selector="metadata.name={}"
This runs as a cronjob pod on a dedicated k8s worker and speeds up the backup process greatly, however running in 32 parallel threads it completely overwhelms pushgateway, no matter how veritcally big it is, all pushes begin to eventually time out even with high timeout set, so even though it is unable to push any metrics it still becomes the performance bottleneck for the backup process.
Our first idea was to scale pushgateway horizontally, but unfortunately, this is not really an option because of fragmentation and the fact that it uses memory-backed storage for metrics. Furthermore, scaling it horizontally is an anti-pattern, according to the developers.
Ideas
To work around that, we have a couple ideas that could be feasible to apply here (in no particular order of preference):
- Allow the option to configure the parallel execution thread count as a chart value, collecting metrics internally for the entire run of the wrapper script and then submit all of the metrics at once to the pushgateway or have an internal rate-limiter/aggregator that would do this every few minutes
- Allow users to set a flag so that the metrics are never submitted, but are instead buffered into a temporary file and implement something like
benji-push-metricstoPUTmetrics under the same label group, refreshing the entire state. This helper can then be called at the end of the run or as a background process continuously, every certain time interval, to interactively update the state representation on pushgateway export endpoint. - Allow users to configure a custom exporter to a file instead of pushgateway, e.g. with a
file://schema for the pushgateway configuration, it's pretty easy to submit a file there usingcurl
Either option would help me scale this better and marvel at mertics at the same time 🙂
We are not developers per se, however if this project is not actively maintained and you liked any of the option more than the others, we could handle implementation given the PR is not going to collect dust.
Please let me know what you think or if you want any more information, I would be glad to help.
Thank you for reaching out, that's quite a lot of volumes that you have. So I gather that submitting several large requests to the pushgateway works but it's overwhelmed by many small requests. At first glance I like your first option the most but it probably is also the most involved one. What I don't like about the other options is that with an external file we might run into issues with concurrent accesses. On the other side I've been toying with the idea of using a slimmed down version of https://argoproj.github.io/argo-workflows/ for automating backups instead of simple cronjobs and such a file could be passed through as an artifact or there could be separate files which are aggregated at the end of the workflow and pushed to the gateway. I will need to think about this a bit.
We also need to consider the fact that aggregating metrics makes it more likely that all or some of them are lost due to uncatched exceptions or other unhandled errors.
@crabique I've extended benji-backup-pvc to accept a list of PVCs which might help with your use case. PVCs can by specified by <name> in which case the namespace specified by --namespace is used as a default or by <namespace>/<name>.
Hi @elemental-lf ! Thanks for the update and sorry for the radio silence.
Unfortunately, this doesn't address the parallel execution aspect, but I think it could be a good workaround to combine this with xargs to have multiple pvc name batches passed to benji-backup-pvc at a time, so that metrics are pushed in batches at a lower interval. How many pvcs would be a sensible number in your opinion?
Apart from the maximum line length which xargs probably takes into account I've no recommendation as to the number of PVCs per benji call. Specifying ten PVCs per call for example should reduce the number of calls to the pushgateway by the same amount. I think you'd have to experiment how much batching you would need to not overwhelm the pushgateway.