atlantis
atlantis copied to clipboard
Receiving 401 response during atlantis apply when using GitHub App authentication method
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
- If you are interested in working on this issue or have submitted a pull request, please leave a comment.
Overview of the Issue
Beginning May 20, we've started receiving the error message below when atlantis apply is run.
{"level":"error","ts":"2022-05-24T12:17:26.429Z","caller":"events/command_runner.go:219","msg":"Unable to check user permissions: non-200 OK status code: 401 Unauthorized body: \"{\\\"message\\\":\\\"Bad credentials\\\",\\\"documentation_url\\\":\\\"https://docs.github.com/graphql\\\"}\"","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/events.(*DefaultCommandRunner).RunCommentCommand\n\tgithub.com/runatlantis/atlantis/server/events/command_runner.go:219"}
This is the same error as in https://github.com/runatlantis/atlantis/issues/2187, but we are not using gh-team-allowlist and are getting the error with 0.17.5 (which does not have that feature) so I opened a separate issue.
In that issue and in https://github.com/runatlantis/atlantis/issues/2090, I've seen notes about checking the GitHub API rate limit, but I'm not sure that's possible with the OAuth installed application, since I don't have a token to call the rate limit APIs (REST or GraphQL, but this appears to be GraphQL from the error message) as that application user.
We originally noticed this behavior when we did an upgrade from 0.17.5 to 0.19.2 and thought that it was a bug, so upgraded to 0.19.3 ~and then downgraded back to 0.17.5 -- but all 3~ but both have been exhibiting the same behavior.
Recreating the Atlantis container immediately resolves the problem (my guess is because it gets a new token from the oauth flow). That is obviously not a great workflow, but it is our workaround for the moment.
Reproduction Steps
While we see it regularly, I don't know how to provide steps that someone else could use to reproduce the behavior.
Environment details
- Atlantis version: ~0.17.5,~ 0.19.2, 0.19.3
- Atlantis flags:
ATLANTIS_GH_ORG="..."
ATLANTIS_GH_APP_ID="..."
ATLANTIS_GH_WEBHOOK_SECRET="..."
ATLANTIS_REPO_CONFIG="/etc/atlantis/atlantis.yaml"
ATLANTIS_GH_APP_KEY_FILE="/etc/atlantis/github-app-key.pem"
ATLANTIS_WRITE_GIT_CREDS="true"
Additional Context
We're running Atlantis as a GitHub app and in a container, which is how we have run it for the past ~1.5 years. Our cadence of updates to the repo that Atlantis is watching has not increased (if anything, it has slowed some).
An update that in fact we are only seeing this on 0.19.2 and .3 -- our original downgrade back to 0.17.5 didn't initially switch so we thought we were still seeing it with the older version.
Because the error is about authentication, we decided to try updating back to 0.19.3 but changing the authentication from GH App to user+token.
Previously the error would occur after a few hours (regardless of activity levels), but so far after changing the authentication, we have not been receiving 401 responses when applying.
We've now been running a 0.19.3 Atlantis version since June 9 using the user+token with no errors.
We'd very much like to use the GH App route again, but it seems fairly clear there's some sort of issue there.
I'm regularly seeing this- I have to restart the atlantis pod in k8s before it will work again. Would very much like to see this resolved. Could this be related to some credential that isn't being refreshed properly?
I have also faced this issue and I started investigating it. I think the bug comes from the GitHub App implementation.
Installation access tokens have the permissions configured by the GitHub App and expire after one hour.
Source: GitHub App documentation.
The problem here is that the token is fetched by Atlantis once when the server starts running and also when a new GitHub repository clone is made. This means that every time a new PR is opened a new clone is made and the token is refreshed. This bug appears when Atlantis enters an "idle" state for more than one hour.
I am not familiar with the codebase, but those are the calls I've found and from where I took this conclusion:
My suggestion for this fix would be the following:
- Compute a timestamp at 55 minutes after the moment when the GitHub App token arrived and cache it;
- If a new refresh token call is made, compare the current timestamp with the cached timestamp and if the current timestamp is after the cached timestamp, perform the token refresh query (and also refresh the cached timestamp), else skip the call;
- Add a token refresh call every time a new
planorapplycomment arrives.
I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.
I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.
@jamengual can you, please, take a look over my response?
We see the same issue with v0.19.9-pre.20220822, but surprisingly only when atlantis doing graphql calls, but never on normal plan/apply operations
@lilincmu this only happens when is a github app.
@valentindeaconu we are working on this
https://github.com/runatlantis/atlantis/issues/2469
This is indeed happening only when using github app auth, which has a 1 hour token lifetime (https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-an-installation) because the token for the graphql calls is only created once - during the client initialization (https://github.com/runatlantis/atlantis/blob/96a25bdbcaeda3596ad5e5878923357b3ac474d2/server/events/vcs/github_client.go#L96).
The underlying ghinstallation library includes the ability to refresh tokens if they're near expiration, I just don't believe we have enough information once the client is created to be able to do that - we need the credentials and the graphql url to be able to make those calls.
I thought about a couple of ways to handle this - the cleanest to me seems to be to remove the initialization of the graphql client from the GithubClient initialization flow, and do that on-the-fly as we're making graphql queries to ensure we always have an up-to-date token.
I'll put a PR with an example setup for this in a few.