atlantis icon indicating copy to clipboard operation
atlantis copied to clipboard

Receiving 401 response during atlantis apply when using GitHub App authentication method

Open cjbehm opened this issue 3 years ago • 6 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Beginning May 20, we've started receiving the error message below when atlantis apply is run.

{"level":"error","ts":"2022-05-24T12:17:26.429Z","caller":"events/command_runner.go:219","msg":"Unable to check user permissions: non-200 OK status code: 401 Unauthorized body: \"{\\\"message\\\":\\\"Bad credentials\\\",\\\"documentation_url\\\":\\\"https://docs.github.com/graphql\\\"}\"","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/events.(*DefaultCommandRunner).RunCommentCommand\n\tgithub.com/runatlantis/atlantis/server/events/command_runner.go:219"}

This is the same error as in https://github.com/runatlantis/atlantis/issues/2187, but we are not using gh-team-allowlist and are getting the error with 0.17.5 (which does not have that feature) so I opened a separate issue.

In that issue and in https://github.com/runatlantis/atlantis/issues/2090, I've seen notes about checking the GitHub API rate limit, but I'm not sure that's possible with the OAuth installed application, since I don't have a token to call the rate limit APIs (REST or GraphQL, but this appears to be GraphQL from the error message) as that application user.

We originally noticed this behavior when we did an upgrade from 0.17.5 to 0.19.2 and thought that it was a bug, so upgraded to 0.19.3 ~and then downgraded back to 0.17.5 -- but all 3~ but both have been exhibiting the same behavior.

Recreating the Atlantis container immediately resolves the problem (my guess is because it gets a new token from the oauth flow). That is obviously not a great workflow, but it is our workaround for the moment.

Reproduction Steps

While we see it regularly, I don't know how to provide steps that someone else could use to reproduce the behavior.

Environment details

  • Atlantis version: ~0.17.5,~ 0.19.2, 0.19.3
  • Atlantis flags:
ATLANTIS_GH_ORG="..."
ATLANTIS_GH_APP_ID="..."
ATLANTIS_GH_WEBHOOK_SECRET="..."
ATLANTIS_REPO_CONFIG="/etc/atlantis/atlantis.yaml"
ATLANTIS_GH_APP_KEY_FILE="/etc/atlantis/github-app-key.pem"
ATLANTIS_WRITE_GIT_CREDS="true"

Additional Context

We're running Atlantis as a GitHub app and in a container, which is how we have run it for the past ~1.5 years. Our cadence of updates to the repo that Atlantis is watching has not increased (if anything, it has slowed some).

cjbehm avatar May 31 '22 16:05 cjbehm

An update that in fact we are only seeing this on 0.19.2 and .3 -- our original downgrade back to 0.17.5 didn't initially switch so we thought we were still seeing it with the older version.

cjbehm avatar Jun 03 '22 12:06 cjbehm

Because the error is about authentication, we decided to try updating back to 0.19.3 but changing the authentication from GH App to user+token.

Previously the error would occur after a few hours (regardless of activity levels), but so far after changing the authentication, we have not been receiving 401 responses when applying.

cjbehm avatar Jun 10 '22 13:06 cjbehm

We've now been running a 0.19.3 Atlantis version since June 9 using the user+token with no errors.

We'd very much like to use the GH App route again, but it seems fairly clear there's some sort of issue there.

cjbehm avatar Jun 14 '22 13:06 cjbehm

I'm regularly seeing this- I have to restart the atlantis pod in k8s before it will work again. Would very much like to see this resolved. Could this be related to some credential that isn't being refreshed properly?

shadiramadan avatar Jul 08 '22 21:07 shadiramadan

I have also faced this issue and I started investigating it. I think the bug comes from the GitHub App implementation.

Installation access tokens have the permissions configured by the GitHub App and expire after one hour.

Source: GitHub App documentation.

The problem here is that the token is fetched by Atlantis once when the server starts running and also when a new GitHub repository clone is made. This means that every time a new PR is opened a new clone is made and the token is refreshed. This bug appears when Atlantis enters an "idle" state for more than one hour.

I am not familiar with the codebase, but those are the calls I've found and from where I took this conclusion:

My suggestion for this fix would be the following:

  1. Compute a timestamp at 55 minutes after the moment when the GitHub App token arrived and cache it;
  2. If a new refresh token call is made, compare the current timestamp with the cached timestamp and if the current timestamp is after the cached timestamp, perform the token refresh query (and also refresh the cached timestamp), else skip the call;
  3. Add a token refresh call every time a new plan or apply comment arrives.

I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.

valentindeaconu avatar Aug 02 '22 12:08 valentindeaconu

I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.

@jamengual can you, please, take a look over my response?

valentindeaconu avatar Aug 05 '22 08:08 valentindeaconu

We see the same issue with v0.19.9-pre.20220822, but surprisingly only when atlantis doing graphql calls, but never on normal plan/apply operations

stasostrovskyi avatar Aug 25 '22 07:08 stasostrovskyi

@lilincmu this only happens when is a github app.

jamengual avatar Aug 26 '22 04:08 jamengual

@valentindeaconu we are working on this

jamengual avatar Aug 26 '22 17:08 jamengual

https://github.com/runatlantis/atlantis/issues/2469

jamengual avatar Aug 26 '22 18:08 jamengual

This is indeed happening only when using github app auth, which has a 1 hour token lifetime (https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-an-installation) because the token for the graphql calls is only created once - during the client initialization (https://github.com/runatlantis/atlantis/blob/96a25bdbcaeda3596ad5e5878923357b3ac474d2/server/events/vcs/github_client.go#L96).

The underlying ghinstallation library includes the ability to refresh tokens if they're near expiration, I just don't believe we have enough information once the client is created to be able to do that - we need the credentials and the graphql url to be able to make those calls.

I thought about a couple of ways to handle this - the cleanest to me seems to be to remove the initialization of the graphql client from the GithubClient initialization flow, and do that on-the-fly as we're making graphql queries to ensure we always have an up-to-date token.

I'll put a PR with an example setup for this in a few.

rayterrill avatar Aug 26 '22 21:08 rayterrill