Metrics for PRs are never expiring ("memory leak")
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
When new PR is opened, Atlantis starts producing metrics for it. Unfortunately, when PR is merged, they are never deleted. In our case we realised we import 500k metrics every minute from Atlantis, being 75MB in size, while we had 0 active PRs.
Reproduction Steps
Just use Atlantis as usual, and its memory usage and number of metrics will grow indefinitely.
Logs
There is nothing worrying in the logs.
Additional Context
@krzysztof-magosa Do you know what metric names or labels become the main cause of high cardinality? Is it mainly the pr_number label?
If yes, I wonder if we should add a --disable-high-cardinality flag as a workaround which would disable the pr_number label (maybe together with the base_repo). cc. @nitrocode @jamengual
I'd be also happy to work on it, if we considered this a way to go (I'd also create a separate issue).
Yes please take a look.
cc: @inkel @lukemassa @albertorm95 @Fabianoshz @TylerLubeck @yoonsio - previous metrics contributors to see if anyone can help or knows where to look for this
https://github.com/runatlantis/atlantis/pulls?q=is%3Apr+metrics+is%3Aclosed+-author%3Aapp%2Frenovate+
FYI. I assume high cardinality was introduced in https://github.com/runatlantis/atlantis/pull/2687
Yes, high-cardinality in this case is due to the pr_number label. This is a tricky one, because on the one hand it would be nice to have plan and apply execution times metered with a finer level of detail, but on the other hand it creates this unnecessary high-cardinality. IME it is not very common to drill down on which specific PR was the slowest to plan/apply, but rather in knowing the averages, thus I vote to drop the label.
I wouldn't say we should lose those finer level metrics, tho, but I think this could be solved within the logs Atlantis generates for each PR.
hi folks - it still happens on latest version, and leads to crashes in environments with limited memory. Would be there any chance to make it configurable, so we can disable unwanted metrics or so?
It still happens on Atlantis 0.33.0. We need to restart instance every ~week, otherwise it consumes tens of gigabytes of memory and crashes.
Still happening on v0.34.0.
Happening also on v0.35.0. Additionally seems that removing metrics configuration from config file does not prevent Atlantis from collecting these metrics internally. In our case 64GB of RAM suffices for 48h of operation, and then Atlantis crashes with out of memory.