connectors icon indicating copy to clipboard operation
connectors copied to clipboard

[Github] `Error while checking for inaccessible repositories. Exception: 403` when trying to sync `private` repositories

Open spong opened this issue 1 year ago • 15 comments

Bug Description

I was trying to sync some internal documentation from the https://github.com/elastic/security-team repo, which is an Elastic private repository (not internal), and if specifying the repo in the List of repositories field within the config, the sync will fail with the following error:

Stack trace

[FMWK][22:44:49][ERROR] [Connector id: sdyRBZABSQy1BdxtPVqF, index name: github-docs, Sync job id: jeLCCpABSQy1BdxtYKnM] Error while checking for inaccessible repositories. Exception: 403, message='Forbidden', url=URL('https://api.github.com/graphql').
Traceback (most recent call last):
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1361, in _get_invalid_repos_for_personal_access_token
    async for repo in self.github_client.get_org_repos(
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 926, in get_org_repos
    async for response in self.paginated_api_call(
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 853, in paginated_api_call
    response = await self.graphql(query=query, variables=variables)
  File "/Users/garrettspong/dev/connectors/connectors/utils.py", line 571, in wrapped
    raise e
  File "/Users/garrettspong/dev/connectors/connectors/utils.py", line 568, in wrapped
    return await func(*args, **kwargs)
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 779, in graphql
    return await self._get_client.graphql(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/gidgethub/abc.py", line 264, in graphql
    status_code, response_headers, response_data = await self._request(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/gidgethub/aiohttp.py", line 19, in _request
    async with self._session.request(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client.py", line 696, in _request
    resp.raise_for_status()
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1070, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://api.github.com/graphql')
[FMWK][22:44:49][ERROR] [Connector id: sdyRBZABSQy1BdxtPVqF, index name: github-docs, Sync job id: jeLCCpABSQy1BdxtYKnM] 403, message='Forbidden', url=URL('https://api.github.com/graphql')
Traceback (most recent call last):
  File "/Users/garrettspong/dev/connectors/connectors/sync_job_runner.py", line 167, in execute
    await self.data_provider.validate_config()
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1466, in validate_config
    await self._remote_validation()
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1429, in _remote_validation
    await self._validate_configured_repos()
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1456, in _validate_configured_repos
    invalid_repos = await self.get_invalid_repos()
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1269, in get_invalid_repos
    return await self._get_invalid_repos_for_personal_access_token()
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 1361, in _get_invalid_repos_for_personal_access_token
    async for repo in self.github_client.get_org_repos(
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 926, in get_org_repos
    async for response in self.paginated_api_call(
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 853, in paginated_api_call
    response = await self.graphql(query=query, variables=variables)
  File "/Users/garrettspong/dev/connectors/connectors/utils.py", line 571, in wrapped
    raise e
  File "/Users/garrettspong/dev/connectors/connectors/utils.py", line 568, in wrapped
    return await func(*args, **kwargs)
  File "/Users/garrettspong/dev/connectors/connectors/sources/github.py", line 779, in graphql
    return await self._get_client.graphql(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/gidgethub/abc.py", line 264, in graphql
    status_code, response_headers, response_data = await self._request(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/gidgethub/aiohttp.py", line 19, in _request
    async with self._session.request(
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client.py", line 696, in _request
    resp.raise_for_status()
  File "/Users/garrettspong/dev/connectors/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1070, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://api.github.com/graphql')

To Reproduce

Steps to reproduce the behavior:

  1. Setup the Github Connector with the following configuration and sync:

Expected behavior

So long as the access token has access to the repo (which it does), the content should be synced.

Environment

Running Kibana main from source, ES via yarn es snapshot, and Github connector main from source as well.

Additional context

If you configure List of repositories to be *, and provide the repo filter via an Advanced Filter (below), syncing will work without issue.

Advanced Filter
[
  {
    "filter": {
      "pr": "is:pr  label:\"Team:Security Generative AI\""
    },
    "repository": "elastic/security-team"
  },
  {
    "filter": {
      "issue": "is:issue label:\"Team:Security Generative AI\""
    },
    "repository": "elastic/security-team"
  }
]

spong avatar Jun 12 '24 22:06 spong

We are able to index documents from a private repository by specifying the repo in the list of repositories. However, we receive a forbidden error only when the rate limit is exceeded, it applies to both private and public repositories.

parthpuri-elastic avatar Jul 05 '24 12:07 parthpuri-elastic

@danajuratoni @artem-shelkovnikov Could you please check this & update?

khushbu-elastic avatar Jul 12 '24 06:07 khushbu-elastic

@spong can you give it a try again? If it does not work, can we pair to investigate it together?

artem-shelkovnikov avatar Jul 12 '24 11:07 artem-shelkovnikov

We are able to index documents from a private repository by specifying the repo in the list of repositories. However, we receive a forbidden error only when the rate limit is exceeded, it applies to both private and public repositories.

@spong does this work for you? If yes, can we close this issue? Also, the PR is merged to main so you can give it a try there as well, meantime we're raising a backport PR.

moxarth-rathod avatar Aug 12 '24 05:08 moxarth-rathod

Sorry @moxarth-elastic, I've been in-and-out on PTO and had been focused on some release items before then so didn't have a chance to confirm/repro. Just catching up on a few things now, but will test and confirm all is good here shortly 👍

spong avatar Aug 12 '24 16:08 spong

Just pulled the latest from elastic/connectors, then followed these instructions creating a new github connector within Kibana, updating the config.yml, then running make install/make run and I'm seeing the same error/issue:

image

After that error, if I go and update the List of repositories configuration from security-team to *, and then manually add the repo filter as detailed in the description, it syncs without issue:

image

Let me know if you need any more details or feel free to reach out on slack if you'd like to pair -- happy to help however I can 🙂

spong avatar Aug 12 '24 21:08 spong

We are experiencing the same situation with rate limits during the initial full sync. Full sync fails, incremental scans the same info, and sync fails in ~20 minutes.

Image

nekrich avatar Aug 30 '24 11:08 nekrich

@moxarth-elastic and I just paired and were able to reproduce on my machine running kibana/es/connectors all from source, on the main branch. In testing we actually saw some documents get ingested this time, but then it errored out with the same above error. Subsequent syncs failed before ingesting any data.

@moxarth-elastic tried reproducing using my same token using both 8.11 and 8.15 cloud deployments and running connectors main locally, and was unable to reproduce the error (all documents synced without issue), so seems this may only be an issue when running all three applications from source.

spong avatar Sep 04 '24 18:09 spong

We are experiencing the same situation with rate limits during the initial full sync. Full sync fails, incremental scans the same info, and sync fails in ~20 minutes.

Image

hi @nekrich we've already fixed this issue in this PR https://github.com/elastic/connectors/pull/2711, did you try to run the connector against that one?

moxarth-rathod avatar Sep 05 '24 05:09 moxarth-rathod

@moxarth-elastic and I just paired and were able to reproduce on my machine running kibana/es/connectors all from source, on the main branch. In testing we actually saw some documents get ingested this time, but then it errored out with the same above error. Subsequent syncs failed before ingesting any data.

That looks weird - as if Github started throttling you out or marked our connector as something breaching security?

Should we follow up with Github on that? @elastic/ingestion-team

artem-shelkovnikov avatar Sep 05 '24 08:09 artem-shelkovnikov

Should we follow up with Github on that?

Yes, please! Does this occur for native connectors as well, or only self-managed ones?

danajuratoni avatar Sep 09 '24 07:09 danajuratoni

@moxarth-elastic @spong reading a bit about throttling, the limits for api keys are quite strict (5000 requests per hour).

Was it possible to run a sync after an hour or so? Have you been able to see the rate limits for your account when syncing?

artem-shelkovnikov avatar Sep 09 '24 11:09 artem-shelkovnikov

@moxarth-elastic @spong reading a bit about throttling, the limits for api keys are quite strict (5000 requests per hour).

Was it possible to run a sync after an hour or so? Have you been able to see the rate limits for your account when syncing?

If the problem is related to rate limit, I should have got this error too but I was able to ingest the documents of the private repo - security-team with the same API token that @spong is using.

I even tested the connector on the Kibana setup in local machine, but i could not reproduce the issue there too. In my case, the connector is working normally. Here is the log file for the reference: github-privaterepo-with-organization.log

moxarth-rathod avatar Sep 09 '24 12:09 moxarth-rathod

I have a feeling it's something weird, maybe anti-abuse kicks in: https://github.com/orgs/community/discussions/24494?

Or, could be something related to local setup (routing, VPNs and such).

@moxarth-elastic - can we add more logs?

Specifically, good to log:

  1. Rate limits when we're rate-limited. We can output all rate-limit related info into debug logs so that we can see if it's related or not
  2. On any API non-200 request have a debug log that says what the Github API actually said

This will help us understand better what's happening and submit a ticket to Github.

artem-shelkovnikov avatar Sep 09 '24 15:09 artem-shelkovnikov

I have a feeling it's something weird, maybe anti-abuse kicks in: https://github.com/orgs/community/discussions/24494?

Or, could be something related to local setup (routing, VPNs and such).

@moxarth-elastic - can we add more logs?

Specifically, good to log:

  1. Rate limits when we're rate-limited. We can output all rate-limit related info into debug logs so that we can see if it's related or not
  2. On any API non-200 request have a debug log that says what the Github API actually said

This will help us understand better what's happening and submit a ticket to Github.

@artem-shelkovnikov Parth has added logs in this PR https://github.com/elastic/connectors/pull/2816, please take a look and drop a suggestion if any.

moxarth-rathod avatar Sep 12 '24 05:09 moxarth-rathod