atlantis Metrics for PRs are never expiring ("memory leak")

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

When new PR is opened, Atlantis starts producing metrics for it. Unfortunately, when PR is merged, they are never deleted. In our case we realised we import 500k metrics every minute from Atlantis, being 75MB in size, while we had 0 active PRs.

Reproduction Steps

Just use Atlantis as usual, and its memory usage and number of metrics will grow indefinitely.

Logs

There is nothing worrying in the logs.

Additional Context

Mar 07 '24 16:03 krzysztof-magosa

@krzysztof-magosa Do you know what metric names or labels become the main cause of high cardinality? Is it mainly the pr_number label?

If yes, I wonder if we should add a --disable-high-cardinality flag as a workaround which would disable the pr_number label (maybe together with the base_repo). cc. @nitrocode @jamengual

I'd be also happy to work on it, if we considered this a way to go (I'd also create a separate issue).

Jan 29 '25 10:01 oleg-glushak

Yes please take a look.

cc: @inkel @lukemassa @albertorm95 @Fabianoshz @TylerLubeck @yoonsio - previous metrics contributors to see if anyone can help or knows where to look for this

https://github.com/runatlantis/atlantis/pulls?q=is%3Apr+metrics+is%3Aclosed+-author%3Aapp%2Frenovate+

Jan 30 '25 06:01 nitrocode

FYI. I assume high cardinality was introduced in https://github.com/runatlantis/atlantis/pull/2687

Jan 30 '25 07:01 oleg-glushak

Yes, high-cardinality in this case is due to the pr_number label. This is a tricky one, because on the one hand it would be nice to have plan and apply execution times metered with a finer level of detail, but on the other hand it creates this unnecessary high-cardinality. IME it is not very common to drill down on which specific PR was the slowest to plan/apply, but rather in knowing the averages, thus I vote to drop the label.

I wouldn't say we should lose those finer level metrics, tho, but I think this could be solved within the logs Atlantis generates for each PR.

Feb 07 '25 01:02 inkel

hi folks - it still happens on latest version, and leads to crashes in environments with limited memory. Would be there any chance to make it configurable, so we can disable unwanted metrics or so?

Mar 13 '25 09:03 krzysztof-magosa

It still happens on Atlantis 0.33.0. We need to restart instance every ~week, otherwise it consumes tens of gigabytes of memory and crashes.

May 08 '25 15:05 krzysztof-magosa

Still happening on v0.34.0.

May 20 '25 15:05 krzysztof-magosa

Happening also on v0.35.0. Additionally seems that removing metrics configuration from config file does not prevent Atlantis from collecting these metrics internally. In our case 64GB of RAM suffices for 48h of operation, and then Atlantis crashes with out of memory.

Jul 23 '25 10:07 krzysztof-magosa