ossinsight
ossinsight copied to clipboard
Sync GitHub repos information
Description
Get GitHub Repos information through GitHub API and store it in the github_repos
table to facilitate further analysis.
In addition, we can display the description information of the repository on the repository analysis page and the trending repos list, which is convenient for users to obtain and filter information faster when browsing the website.
github_repos
table schema
+--------------------+--------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+-------------------+-------+
| repo_id | int(11) | NO | PRI | <null> | |
| repo_name | varchar(150) | NO | MUL | <null> | |
| owner_id | int(11) | NO | MUL | <null> | |
| owner_login | varchar(255) | NO | | <null> | |
| owner_is_org | tinyint(1) | NO | | <null> | |
| description | varchar(512) | NO | | | |
| primary_language | varchar(32) | NO | | | |
| license | varchar(32) | NO | | | |
| size | bigint(20) | NO | | 0 | |
| stars | int(11) | NO | MUL | 0 | |
| forks | int(11) | NO | | 0 | |
| parent_repo_id | int(11) | YES | | <null> | |
| is_fork | tinyint(1) | NO | | 0 | |
| is_archived | tinyint(1) | NO | | 0 | |
| is_deleted | tinyint(1) | NO | | 0 | |
| latest_released_at | timestamp | YES | | <null> | |
| pushed_at | timestamp | YES | | <null> | |
| created_at | timestamp | YES | | <null> | |
| updated_at | timestamp | YES | | <null> | |
| refreshed_at | timestamp | NO | | CURRENT_TIMESTAMP | |
+--------------------+--------------+------+-----+-------------------+-------+
Steps
- [x] Pull GitHub repos data from
github_events
table (Obtain the primary language of the repo from thegithub_events
table) - [x] Sync repos data in batch by GitHub GraphQL API
- [x] ~~Sync fork repos according to the non-fork repos by list repo forks API (Not applicable to repository with large forks)~~
- [x] Marked deleted repos.
- [x] Sync rest repos data in concurrent by REST API (This method is inefficient, just as a supplementary solution)
- [ ] #912
I found some GitHub "users" created a large number of repositories, and most of them have been deleted, which can't be fetched by API.
gharchive_dev> SELECT
-> owner_id, ANY_VALUE(owner_login) AS owner_login, COUNT(1) AS repos
-> FROM github_repos r
-> WHERE updated_at IS NULL
-> GROUP BY owner_id
-> ORDER BY repos DESC
-> LIMIT 100;
100 rows in set
Time: 6.186s
For example:
- @direwolf-github has created more than 3 million repos in the past.
- @codestar-github-bot-1 has created more than 1.49 million repos in the past.
+-----------+-----------------------------------------+---------+
| owner_id | owner_login | repos |
+-----------+-----------------------------------------+---------+
| 10810283 | direwolf-github | 3094769 |
| 53108186 | codestar-github-bot-1 | 1492979 |
| 75841818 | shalini-devgit | 716358 |
| 1686007 | efarbereger | 461036 |
| 42283162 | aws-aemilia | 426637 |
| 52290251 | botcopado | 415988 |
| 50472621 | Likj128 | 409137 |
| 16960717 | idsqat1 | 171634 |
| 19253721 | pmamico | 171172 |
| 51386346 | c3-github-integration-test-user-3 | 171072 |
| 29713643 | autotester-one | 166855 |
| 37852658 | aws-risk-analysis | 158363 |
| 46203446 | mathias7777 | 128096 |
| 2804557 | dvcsconnectortest | 123299 |
| 87753227 | thatjohn0a | 121717 |
| 20675874 | isisliu | 121612 |
| 85441621 | thatjohn01 | 118020 |
| 87852462 | thathttp01 | 114024 |
| 31313482 | srbbtest | 112461 |
| 18148588 | idsb3t1 | 108922 |
| 88047089 | sometim110 | 107797 |
| 87852675 | thatfeel01 | 105234 |
| 32371408 | codestar-bot | 104297 |
| 18310276 | favroautomation | 90209 |
| 7911893 | slskopytko | 88340 |
| 88182477 | yoobao55 | 80997 |
| 20931201 | atomist-test-web | 70762 |
| 21099664 | atomist-web-test-staging | 65375 |
| 15947550 | c3-github-integration-test-user | 63508 |
| 22006877 | luis-ordering | 60877 |
| 91138975 | appsmithgitsynctest | 54150 |
| 30481173 | osiotestmachine | 53008 |
| 78093351 | newproject1 | 52672 |
| 88618086 | YQjydnbj | 46360 |
| 88552392 | UTguzkiu | 46156 |
| 88569843 | Grvrlrkm | 46018 |
| 88953461 | L2scbmpn | 45983 |
| 88954328 | kdqqxmoq | 45859 |
The top 1000 owners have created more than 13.2 million repos.
gharchive_dev> SELECT SUM(repos) AS total
-> FROM (
-> SELECT
-> owner_id, ANY_VALUE(owner_login) AS owner_login, COUNT(1) AS repos
-> FROM github_repos r
-> WHERE updated_at IS NULL
-> GROUP BY owner_id
-> ORDER BY repos DESC
-> LIMIT 1000
-> ) sub;
+----------+
| total |
+----------+
| 13285264 |
+----------+
1 row in set
Time: 21.423s
I will check out them and mark them is_deleted = 1
in batch if they have been deleted.
gharchive_dev> UPDATE github_repos SET is_deleted = 1 WHERE owner_id = 53108186 AND repo_id NOT IN(496340058, 496341317, 496337981, 496337632, 496341286, 496338008, 4
-> 96341057, 496339354, 496338643);
Query OK, 1492979 rows affected
Time: 76.872s
When obtaining the repos created in 2017, the GitHub API probably returns an error. This will cause us to lose a lot of data.
ERROR Failed to fetch repos: {
q: 'fork:true created:2017-01-24T03:22:30.000+00:00..2017-01-24T03:45:00.000+00:00',
cursor: 'Y3Vyc29yOjEwMA=='
} Request failed due to following response errors:
- Something went wrong while executing your query. Please include `E322:7BD7:217F13F:2261FF6:634EC1F7` when reporting this issue.
- Something went wrong while executing your query. Please include `E322:7BD7:217F13F:2261FF6:634EC1F7` when reporting this issue.
at node_modules/.pnpm/@[email protected]/node_modules/@octokit/graphql/dist-src/graphql.js:48:19
at async extractReposForTimeRange (scripts/sync-repos/syncer.ts:179:25)
at async scripts/sync-repos/syncer.ts:45:36
ERROR Failed to fetch repos: {
q: 'fork:true created:2017-01-24T03:22:30.000+00:00..2017-01-24T03:45:00.000+00:00',
cursor: 'Y3Vyc29yOjEwMA=='
} Request failed due to following response errors:
- Something went wrong while executing your query. Please include `E324:5A7B:A10702D:A5CE8EA:634EC1FC` when reporting this issue.
- Something went wrong while executing your query. Please include `E324:5A7B:A10702D:A5CE8EA:634EC1FC` when reporting this issue.
at node_modules/.pnpm/@[email protected]/node_modules/@octokit/graphql/dist-src/graphql.js:48:19
at async extractReposForTimeRange (scripts/sync-repos/syncer.ts:179:25)
at async scripts/sync-repos/syncer.ts:45:36
ERROR Failed to batch load 475 records, retries: 2/2: Out of range value for column 'size' at row 182
at PromisePool.query (node_modules/.pnpm/[email protected]/node_modules/mysql2/promise.js:341:22)
at BatchLoader.flush (app/core/BatchLoader.ts:86:40)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async Timeout._onTimeout (app/core/BatchLoader.ts:59:21)
ERROR Failed to batch load 475 records, they will be lost: Out of range value for column 'size' at row 182
at PromisePool.query (node_modules/.pnpm/[email protected]/node_modules/mysql2/promise.js:341:22)
at BatchLoader.flush (app/core/BatchLoader.ts:86:40)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async Timeout._onTimeout (app/core/BatchLoader.ts:59:21)
Finished?