atlantis
atlantis copied to clipboard
feat: Add `--max-comments-per-command` configuration
what
Allow Atlantis administrators to cap the number of VCS comments that will be created for a single command. (The default value is 0, which means unlimited comments.)
why
If you're trying something like TF_LOG=debug, Atlantis can produce so many comments that it becomes challenging to read your PR, or worse still, your VCS might rate-limit Atlantis, effectively breaking your ability to use it for an extended period of time. While this PR does not change the default behavior, it's probably a good idea to set this flag to something like 10.
tests
Unit tests for common.SplitComment. (There doesn't appear to be other tests of comment splitting.)
references
- Fixes #416.
Hi! This is my first Atlantis PR and is a basic attempt to fix #416. I'd love to have some high-level feedback as to whether this is a reasonable approach that would be accepted before I finish it.
So far I've just focused on the top level structure of how to add a new entry; before this could be merged I would write implementations for non-GitHub VCS implementations.
@glasser thanks for the contribution! I think this is definitely a step in the right direction.
I've recently been a bit worried about the growing number of configuration options to server, I wonder if a first pass sets a hard-coded value like 10, or even 100? We can always make it configurable later, I just feel like the marginal benefit of comments goes way down after even just a couple, and if you need to capture 327,680 (or 3M) characters, you can (and should) be using a "debug workflow" with a custom command. We could even write a brief "how to" and link to it in the footer. I'm interested in what the maintainers have to say about the tradeoff between not changing current behavior vs stemming the increasing tide of configuration options. And also to you @glasser whether there's a hard-coded number you think would be sufficient (or even, given this is implemented, what you had planned to set it at for your setup)?
Another unrelated concern I have is clipping the end, when likely the last few lines will have a lot more value than the lines in the middle. I'm not sure really how to address that, and that could will be a future improvement, something akin to: https://github.com/golang/go/issues/7181. For example if there are determined to be n comments worth of text and the max is k < n, we show the first k-2, one comment that says "elided nth part? Again this is just an idea for a possible future improvement, I don't think it needs to complicate this PR.
All that said, certainly some logs are better than the crashing behavior we have now, so I definitely want to pursue this.
I do like the idea of preserving (say) the last page. That said, you can also view the full log at the web UI anyway, right? Which in practice is what I'd do for any log that triggers this.
I was trying to avoid backwards-incompatible changes without any way to even revert it, thus the configuration. If you'd prefer an uncontrollable change, I'd get that. My project will probably limit to 10.
I do like the idea of preserving (say) the last page. That said, you can also view the full log at the web UI anyway, right? Which in practice is what I'd do for any log that triggers this.
Yeah good point. Let's leave this as "maybe do" for later.
I was trying to avoid backwards-incompatible changes without any way to even revert it, thus the configuration. If you'd prefer an uncontrollable change, I'd get that. My project will probably limit to 10.
That's my personal preference, others may differ. We can always make it configurable later if need be, I just want to defer adding a server flag if possible, since those are harder to remove.
I personally agree that getting back >10 comments, each with 30k characters of text, feels pretty cumbersome, and users should at that point be encouraged to pursue a different way of saving/shipping logs.
Ok, cool. I'm not sure how the process works around here -- before I go and rip out the option, is this a relatively solid decision and you're likely to merge the PR, or should we wait for more thoughts?
@GenPage ,@nitrocode and I are the maintainers and @lukemassa has been helping with a lot of contributions lately.
let's wait for us to catch up with this PR in a few weeks.
Thanks
Ah sorry, I should have been more clear; I was just offering my opinion, I don't have any authority on the matter :) I'll let @jamengual take it from here with a recommendation, apologies for not clarifying that.
Ah sorry, I should have been more clear; I was just offering my opinion, I don't have any authority on the matter :) I'll let @jamengual take it from here with a recommendation, apologies for not clarifying that.
all good @lukemassa your input is super valuable to us and is always welcome.
I hope next week we could find more time to do reviews, plus we are in the process of changing the release pipeline do is going to take some time.
Hi, I'm just a regular engineer and I absolutely do not mean to knock the work you've put it. I really appreciate every contributor to this project and it's helped us so much. I think this PR is pretty useful functionality to have, but maybe some minor tweaks. I see a few issues:
- The argument to GitubClient seems unrelated to the rest of the arguments. Could there be a better place to get this value, or otherwise have a more generic pattern for passing user configurations to to SCM providers? For example, where is CHECKOUT_DEPTH provided? Of all the arguments, it is certainly the odd one out
- I echo the previous concerns about truncating. Truncating the middle is vastly preferable. You'd miss a huge amount of context with this, like how many changes were made, what failed, any error messages. Unless the logs were saved somewhere else (like disk, s3 bucket, etc), the logs of the run would be pretty much useless
- I'd almost prefer some sort of "minification" before a truncation. I find most of my "many comments PRs" come from changes to inline policy tempates used across many roles. Terraform doesn't do any sort of comparison right now and lists the entire output of both the "removed" role and then "new" role. Most of that is useless for anything, so maybe "diffs that are larger than 100-200-whatever lines" get a truncated in the middle. That would dovetail nicely with functionality to save these logs off to s3 or somewhere else, while hardly impacting the usability in any way
- I'm not exactly sure how to achieve this (not familiar with all of scm arch in Atlantis) but I don't see why this would be implemented separately for each SCM (github/gitlab/whatever the other one is). It doesn't seem like configuring these separately per SCM would be especially valuable
These are just my 2c, we do run into this issue sometimes. We would never use TF_LOG=lebug in Atlantis such we're admins and can run things locally, but honestly I don't think you'd ever want that output directly in a PR comment. It might contain tokens or sensitive values and lots of people can see those PR comments. It would be far better to filter ALL debug output from the comments and have a secondary storage medium for unfiltered output (that is of course highly restricted like Atlantis itself should be). Disc is easiest at first, but then you have to think about cleanup. S3 is AWS but there are tools that allow mounting a bucket as a file path. Then you can push cleanup onto the users and avoid a whole mess of complexity around filling your disk withint Atlantis itself
The argument to GitubClient seems unrelated to the rest of the arguments. Could there be a better place to get this value, or otherwise have a more generic pattern for passing user configurations to to SCM providers? For example, where is CHECKOUT_DEPTH provided? Of all the arguments, it is certainly the odd one out
Sure, it would be reasonable to have a struct for common configuration for creating comments.
I'm not exactly sure how to achieve this (not familiar with all of scm arch in Atlantis) but I don't see why this would be implemented separately for each SCM (github/gitlab/whatever the other one is). It doesn't seem like configuring these separately per SCM would be especially valuable
It's not being configured separately per SCM, just passed through. (The different SCMs do have different formats for their headers/footers and length limits, which is why the split call is happening inside the SCM code.)
It's not being configured separately per SCM, just passed through. (The different SCMs do have different formats for their headers/footers and length limits.)
Totally missed that. awesome, great work. It wouldn't especially help our team / teams (we have 10+ atlantis instances running internally, my team runs 2 - one for dev, one for prod), but what I am seeing is that teams are simply taking the output and persisting it elsewhere via a python script.
I (highly) appreciate everyone who has contributed to this project. My ideas were mostly suggestions to spur future thoughts, I don't expect them to make it into your PR or want to complicate things. However middle truncation is kind of a base level functionality. It would almost be better to limit the maximum number of projects that can run, or to allow custom filtering inline.
Many different people are using Atlantis. If this would be e.g., a default of 10 comments per atlantis command, e.g., atlants plan, then nearly every single PR we have would be truncated. We have a monorepo (this was to get the team familiar with tf, now we're splitting new things out) with (these are made up numbers) ~40 root modules and ~40 child modules. We have 8 separate environments. So if you were to make a change that impacted 4 modules that were used in all environments, that's 32 different runs of terraform alone. It's easy to make a change that requires 50, 60 terraform executions. If you have even a single module that needs to split outputs into multiple comments, you're automatically looking at 70+ comments just for a SINGLE execution of atlantis plan. It's not uncommon to have 150+ comments from Atlantis by the time its done
I'm not expecting you to change anything our behalf. Just trying to give some perspective on different use cases in case it's useful. Huge appreciators of the maintainers along with everyone who contributes great useful features like this!
That seems like a good data point about the question @lukemassa raised as to whether it would be reasonable to have this be a non-configurable hardcoded value (with the answer "no").
That seems like a good data point about the question @lukemassa raised as to whether it would be reasonable to have this be a non-configurable hardcoded value (with the answer "no").
We would be stuck on whatever version of Atlantis allowed us to use it, I guess :)
If there was some sort of mechanism to persist the file logs to "offsite storage" (a bucket seems logical since there are relatively few restrictions beyond access, and it removes the burden of cleanup from Atlantis entirely, let users cleanup their own bucket). Then that would completely resolve any concerns about truncation, number of comments, whatever.
Tne logs in the "Detail Link" are kind of a joke unless you're just testing a single module to learn terraform and have time to sit around and watch it. 95% of the time the data is gone by the time we need it. Since Atlantis is in a completely isolated environment, non-admins can't get anywhere near that cluster. I don't think more advanced devops people would care if it was removed entirely for the most part. Our users can't access it and it would be empty if they did. Without persistence, pre/post workflow hooks, stuff like that it's maybe between 0-5% useful in my experience. It's certainly not close to a replacement for the PR comments which are available for more than the 3-4 minutes it takes for the module to run. The single usecase is "my module has been running for 45 minutes and I have no idea what's going on, lets see whats up*. But out of thousands and thousands of deployments we've never actually needed it to debug that issue with those logs (hint: it's network connectivity or you're replacing node groups on an eks cluster).
The point there is that those live streaming logs are not a replacement for anything. Possibly "offsite" persistence via s3 could actually make those logs useful, who knows. I really apologize if one of you implemented those. Absolutely nothing personal and I'll buy you a drink next time you're in town. Sorry for being brutally honest
We have teams persisting to gist but I don't think that's generic enough for Atlantis code. There just needs to be an extremely simple adapter to optionally persist the full, unaltered logs to a second "offsite" location. Lifecycle rules, cleanup, whatever, could be entirely left to the user. No need to complicate Atlantis.
That would remove any pressure on changes to the PR comments, truncationation, formatting anything. No matter what, you'd always be able to refer to the actual unfiltered output. The PR comments are critical right now because you can't do anything without them. They're the source of truth. Full, unaltered text output could become the source of truth, and give more flexibility to alterations in the content of the comments.
And hey you probably read those files back in when someone wants to see the output of "job" via the "Show Details" link. Then rather than the constant disappointment that greets users when they see that empty, black screen, there's a chance it could help them solve a problem
Just to recap since my comments might be interpreted as off-topic. Here here's is gist
- This directly relates to the "comments issue" because it would always provide a fallback option if something got truncated in the comments. Otherwise, we're no matter what truncation is losing some data
- Right now the PR comments are the ONLY viable way to see what happened to during an atlantis execution
- The "Show Details" link is not helpful because it's empty most of the time, seems to be empty as soon as the build finishes
- With fully enabled and unfiltered "offsite" storage, Comments could be experimented with and made much more flexible without breaking the single interface to your job results
- offsite storage (vs storing it on the volume) means that Atlantis doesn't need to implement ANY logic for cleanup. Make users do it, there are so many options, S3 has lifecycle rules for pretty much anything and a tiny lambda could do it. Atlantis shouldn't need to worry about it
There is no fallback right now if Truncation is merged and breaks a 1/10000 use case. Changes to the primary interface with 0 fallbacks are going to be inherently risks. People will come in hot with stuff like "<you broke my insane use case of programmatically parsing the GHE comments and using them in our issue tracker system>". At least if they're in s3 or somewhere you can tell them to use that instead. like git plumbing vs porcelain commands. offsite output is plumbing. PR comments are porcelain.
@glasser @brandon-fryslie I'm a bit conflicted by this feature. It is trying to solve a problem that does not pertain to Atlanti's core features, and I will explain why.
- Terraform Debug: is usually considered sensitive; it is not a good idea to expose that anywhere it can be saved, especially in a repo where it will be held forever.
- Debug logs: Debug/Trace logs are usually too lengthy and traditionally sent to other systems for reconciliation, searching, etc.
Based on that, if you want to debug a run in Atlantis, you should enable debug and use the Detail live logs to see the logs, but I won't recommend running a debug run based on my earlier comment.
Atlantis support for external log storage will be better suited for this kind of stuff, where you can send logs to s3, cloud watch, fluentd or any other system instead of your pr comments. You could have a repo server-side config like:
external-logging: enabled
external-log-facility: localhost:2020
external-log-strategy: debug ( options could be debug, all)
That could point to your log aggregator to then send the log somewhere else to be seen by whoever is trying to debug.
That feature does not exist and will be to be built.
@GenPage
I like the idea here of having a generic logging mechanism, I think that's definitely a better way of handling large logs rather than posting dozens of comments in PRs! :)
That said, I think some variant of this PR should still be pursued. Not as a solution to "how do I read long logs" (which I agree should be dealt with in a separate issue). But the particular problem of, "how do I prevent my VCS from falling over/rate limiting me/etc when I accidentally send it 100s of comments".
For that particular issue, I think we should impose a limit, similar to what's in the PR, of some hard-coded value (10?), with the expectation that the user will then change their behavior (turn off debug? add a custom workflow?) if they want to capture >300k characters.
@jamengual We do not, and would never, use TF_LOG=debug within Atlantis itself. the potential for security issues would be too risky.
I am speaking more generally around modules with extremely long output. We have 60+ modules and some of them have quite a lot of output. In addition, the "show details" link is basically useless unless you've got enough time to sit there and watch your deployments line by line. Our implementation is larger. By the time you check in (or get a notification that something failed, etc), they're gone. I understand not wanting to lead memory and clean up.
External log storage would be extremely preferable to me, and backing up those "Show Detail" links from external storage would be a delight. We've had runs where there are literally dozens upon dozens of comments "continued" from a previous comment.
That is in no way useful and often comes from something innocuous like someone modifying an inline policy that is used across dozens of roles that exist in every environment (inline policy = prints both old and new policy, does not diff, so could be 100 lines, x 30 roles, x 8 environments = 24k lines). Yes maybe it shouldn't be an inline policy, but there can be other examples where the output is simply too large for github comments to be useful. I think that's the crux of the problem.
And +1000 to being able to get this stuff into Elastic Search. Right now people are asking "How often does this bug happen (bug with our implementation of Atlantis)? Is it happening more in certain environments? Certain modules? And I don't have a good answer because there's no way to query like that.
Even writing them to disc somewhere would let us add a sidecar to forward them to an aggregator. Atlantis shouldn't need to understand S3 or the million options for offline storage. Just give us the ability to write them to disc in some folder structure, then we can have a sidecar handle our needs as far as getting them elsewhere. The standard logrotator binary could be configured for cleanup, and while they're on disc you can use them for the Show Details link so those become useful. It's a win all around.
For that particular issue, I think we should impose a limit, similar to what's in the PR, of some hard-coded value (10?), with the expectation that the user will then change their behavior (turn off debug? add a custom workflow?) if they want to capture >300k characters.
Nearly every single PR we open has more than 10 (separate, individual) comments. If someone changes a line in a module that is used by 5 other modules that are deployed to 5 environments, we're already looking at 25 comments minimum. It's never a big problem (except Atlantis isn't hiding the old comments right now for whatever reason). The max should absolutely be configurable, or maybe Atlantis actually deletes every comment that is over the threashold. 10 comments wouldn't get us through a single PR review. We also run 2 instances of Atlantis (obvs they run on mutually exclusive projects, one is dev and one is prod).
I see ok, I understand the issue a bit better now.
I know @lukemassa we do not want to add more flags but this feature I think it needs to be configurable since it depends so much on the user workflow, we can set a default if not set.
I looked into this closer, and didn't realize that gitlab (the VCS I use) allows 1M characters per comment, whereas github only allows 65k! I have some relatively large plans that would easily span a few comments were I to use github, I didn't realize it was such a discrepency.
In that case I think making this configurable is a reasonable option.
This issue is stale because it has been open for 1 month with no activity. Remove stale label or comment or this will be closed in 1 month.
It seems like this is unlikely to be accepted? I do get that TF_LOG=debug probably just shouldn't be done and that that's the most likely way that this would end up overloading the rate limit. Still a bit of a bummer though that it can happen.
My opinion is we should still pursue this PR with a few tweaks:
- Frame this option in the comments/documentation not as a feature to enable large logs, but an operational guardrail against spamming your VCS
- Set the default to 100 (though keep the semantics of 0 means unlimited)
- Add a comment to the documentation that says "if you are generating huge output per run, consider using XYZ mechanism"
- If such a mechanism doesn't exist yet, I think that's a worthwhile feature, but I fear adding it here would be feature creep. Instead, open a separate issue and have the comment say "If you are generating huge output per run, please discuss and give feedback on open issue 1234"
Again I have no authority here I'll leave it up to the maintainers to weigh in, just my two cents.
This issue is stale because it has been open for 1 month with no activity. Remove stale label or comment or this will be closed in 1 month.
Still in the same state — not sure what maintainers want from this.
Maintainer here, I agree with Luke's opinion here. Make sure docs are updated and clear that this isn't an intended feature for logging but to prevent VCS PR comment bloat spam.
My opinion is we should still pursue this PR with a few tweaks:
- Frame this option in the comments/documentation not as a feature to enable large logs, but an operational guardrail against spamming your VCS
- Set the default to 100 (though keep the semantics of 0 means unlimited)
- Add a comment to the documentation that says "if you are generating huge output per run, consider using XYZ mechanism"
- If such a mechanism doesn't exist yet, I think that's a worthwhile feature, but I fear adding it here would be feature creep. Instead, open a separate issue and have the comment say "If you are generating huge output per run, please discuss and give feedback on open issue 1234"
Again I have no authority here I'll leave it up to the maintainers to weigh in, just my two cents.
@GenPage Thanks! Do you think it should be limiting by default or something you have to turn on to change the current behavior?
@glasser Let's not limit it by default in this PR, but please make sure in the docs that we plan to default in the future. This will give people some time to get used to the feature themselves (opt-in vs opt-out)