[GitHub] GraphQL query to GitHub API regularly fails with 502 Bad Gateway error
GraphQL query to GitHub API regularly fails with 502 Bad Gateway error
Expected behavior
Cartography should fetch data from GitHub in a more efficient way.
Actual behavior
Fetching data from GitHub is slow. Requests to GitHub API regularly failed with 502 error, causing additional requests, timeouts, sleeps.
To Reproduce
Have GitHub organization with Go, or PHP applications (with go.mod and composer.lock files)
Logs
INFO:cartography.intel.github.repos:Syncing GitHub repos
WARNING:cartography.intel.github.util:GitHub: Received 502 response. Reducing page size to 25 and retrying.
WARNING:cartography.intel.github.util:GitHub: Received 502 response. Reducing page size to 12 and retrying.
WARNING:cartography.intel.github.util:GitHub: Received 502 response. Reducing page size to 6 and retrying.
WARNING:cartography.intel.github.util:GitHub: Received 502 response. Reducing page size to 3 and retrying.
WARNING:cartography.intel.github.util:GitHub: Received 502 response. Reducing page size to 1 and retrying.
ERROR:cartography.intel.github.util:GitHub: Could not retrieve page of resource `repositories` due to HTTP error after 5 retries. Raising exception.
NoneType: None
ERROR:cartography.sync:Unhandled exception during sync stage 'github'
Traceback (most recent call last):
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/sync.py", line 152, in run
stage_func(neo4j_session, config)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/util.py", line 225, in timed
return method(*args, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/__init__.py", line 70, in start_github_ingestion
cartography.intel.github.repos.sync(
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/repos.py", line 1234, in sync
repos_json = get(github_api_key, github_url, organization)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/util.py", line 225, in timed
return method(*args, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/repos.py", line 303, in get
repos, _ = fetch_all(
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 184, in fetch_all
raise exc
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 154, in fetch_all
resp = fetch_page(token, api_url, organization, query, cursor, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 110, in fetch_page
response = call_github_api(query, gql_vars_json, token, api_url)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 75, in call_github_api
response.raise_for_status()
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://api.github.com/graphql
Traceback (most recent call last):
File "/var/cartography/.local/bin/cartography", line 10, in <module>
sys.exit(main())
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/cli.py", line 1405, in main
sys.exit(CLI(prog="cartography").main(argv))
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/cli.py", line 1380, in main
return cartography.sync.run_with_config(self.sync, config)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/sync.py", line 292, in run_with_config
return sync.run(neo4j_driver, config)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/sync.py", line 152, in run
stage_func(neo4j_session, config)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/util.py", line 225, in timed
return method(*args, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/__init__.py", line 70, in start_github_ingestion
cartography.intel.github.repos.sync(
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/repos.py", line 1234, in sync
repos_json = get(github_api_key, github_url, organization)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/util.py", line 225, in timed
return method(*args, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/repos.py", line 303, in get
repos, _ = fetch_all(
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 184, in fetch_all
raise exc
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 154, in fetch_all
resp = fetch_page(token, api_url, organization, query, cursor, **kwargs)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 110, in fetch_page
response = call_github_api(query, gql_vars_json, token, api_url)
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/cartography/intel/github/util.py", line 75, in call_github_api
response.raise_for_status()
File "/var/cartography/.local/share/uv/tools/cartography/lib/python3.10/site-packages/requests/models.py", line 1026, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://api.github.com/graphql
Screenshots
If applicable, add screenshots to help explain your problem.
Please complete the following information:
- Cartography release version or commit hash [e.g. 0.12.0 or 95e8e11913e2a44a4d4682506d8364a638ceac69]
0.121.0
- Python version: [e.g. 3.10.0]
3.10.19 (technically we are using cartography Docker image.
- OS (feel free to omit this if you don't think it's relevant to your issue): [e.g. Ubuntu bla bla, OSX bla bla]
the one used in Cartography Docker Image
Additional context
After some investigation and debug sessions, I've found that the root cause of that behavior is the following block in intel/github/repos.py:
dependencyGraphManifests(first: 20) {
nodes {
blobPath
dependencies(first: 100) {
nodes {
packageName
requirements
packageManager
}
}
}
}
Without that block GraphQL always returns full answer with original page size of 50 repos pretty fast. I propose to move extraction of this data to standalone API query for every repo (the same as it is done for collaborators in function _get_repo_collaborators_inner_func
Additional request is that maybe we can add an option for GitHub intel module to ignore dependencies extraction. We are not actively using all of those dependencies because we have other systems for SCA. Something similar to "aws-requested-syncs" option.
Hi @pvasilevich, thanks for filing this.
There are a few options for addressing this:
-
Decompose the large graphql query. We use the GitHub graphql API because it tends to be easier to retrieve lots of data without hitting limits, but in practice when the limits are exceeded we hit 502s with not a lot of context. For the issue you're filing, splitting dependency graph fetching to a separate query should improve the reliability.
-
Allow selected resources for GitHub: similar to how in aws we have
--aws-requested-syncs, we'd like to implement this as a larger, cross-cutting change. We don't want to block fixing this issue on that work though.
Other options ruled out
- Fall back to the REST API: REST API endpoints have very small rate limits, feels like a step backwards
- Additional retries/backoff: we already have this
Would definitely appreciate help here and offer guidance