llnl.github.io
llnl.github.io copied to clipboard
Potential errors when scraping new organization. Skipped repos.
I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)
For some repositories, it seems to take much longer and the script prints out warning-like messages such as:
Sending REST query...
Checking response...
HTTP/1.1 202 Accepted
API Status {"limit": 5000, "remaining": 4414, "reset": 1607114323}
Query accepted but not yet processed. Trying again in 3sec...
Also, for a very small minority of repos, I get the following error-like message:
GraphQL API error.
[{"path": ["repository", "dependencyGraphManifests"], "locations": [{"line": 1, "column": 244}], "message": "loading"}]
These two errors do not seem to occur simultaneously.
The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.
Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.
Steps to reproduce:
- Remove all data from explore/github_data.
- Remove all repos and orgs from _explore/input_lists.json, and add "NREL" as an org.
- Create python environment and install dependencies from requirements.txt
- Set GITHUB_API_TOKEN environment variable
- Run ./MASTER.sh
The update can take a long time. Our current daily update typically runs for about an hour.
The warning messages with the 202 Accepted
response typically come from the commit activity query, and should be expected. That response means that data in particular requires GitHub's side to do some additional internal processing before it can response. The initial query triggers that GitHub process, and the script then repeats the query after allowing it time to finish to return the desired data. The commit activity specifically is then cached (on GitHub's side) for immediate responses for about 24 hours / the rest of the day.
The generic GraphQL API error
message means something went wrong on GitHub's side. Sometimes these are intermittent issues, in which case the script will attempt the query again. Other times, this can be caused by something like an empty repo. A closer examination of _explore/LAST_MASTER_UPDATE.log
may reveal what is happening in these cases.