ossinsight icon indicating copy to clipboard operation
ossinsight copied to clipboard

Sync GitHub repos information

Open Mini256 opened this issue 1 year ago • 4 comments

Description

Get GitHub Repos information through GitHub API and store it in the github_repos table to facilitate further analysis.

In addition, we can display the description information of the repository on the repository analysis page and the trending repos list, which is convenient for users to obtain and filter information faster when browsing the website.

github_repos table schema

+--------------------+--------------+------+-----+-------------------+-------+
| Field              | Type         | Null | Key | Default           | Extra |
+--------------------+--------------+------+-----+-------------------+-------+
| repo_id            | int(11)      | NO   | PRI | <null>            |       |
| repo_name          | varchar(150) | NO   | MUL | <null>            |       |
| owner_id           | int(11)      | NO   | MUL | <null>            |       |
| owner_login        | varchar(255) | NO   |     | <null>            |       |
| owner_is_org       | tinyint(1)   | NO   |     | <null>            |       |
| description        | varchar(512) | NO   |     |                   |       |
| primary_language   | varchar(32)  | NO   |     |                   |       |
| license            | varchar(32)  | NO   |     |                   |       |
| size               | bigint(20)   | NO   |     | 0                 |       |
| stars              | int(11)      | NO   | MUL | 0                 |       |
| forks              | int(11)      | NO   |     | 0                 |       |
| parent_repo_id     | int(11)      | YES  |     | <null>            |       |
| is_fork            | tinyint(1)   | NO   |     | 0                 |       |
| is_archived        | tinyint(1)   | NO   |     | 0                 |       |
| is_deleted         | tinyint(1)   | NO   |     | 0                 |       |
| latest_released_at | timestamp    | YES  |     | <null>            |       |
| pushed_at          | timestamp    | YES  |     | <null>            |       |
| created_at         | timestamp    | YES  |     | <null>            |       |
| updated_at         | timestamp    | YES  |     | <null>            |       |
| refreshed_at       | timestamp    | NO   |     | CURRENT_TIMESTAMP |       |
+--------------------+--------------+------+-----+-------------------+-------+

Steps

  • [x] Pull GitHub repos data from github_events table (Obtain the primary language of the repo from the github_events table)
  • [x] Sync repos data in batch by GitHub GraphQL API
  • [x] ~~Sync fork repos according to the non-fork repos by list repo forks API (Not applicable to repository with large forks)~~
  • [x] Marked deleted repos.
  • [x] Sync rest repos data in concurrent by REST API (This method is inefficient, just as a supplementary solution)
  • [ ] #912

Mini256 avatar Sep 09 '22 14:09 Mini256

I found some GitHub "users" created a large number of repositories, and most of them have been deleted, which can't be fetched by API.

gharchive_dev> SELECT
            ->     owner_id, ANY_VALUE(owner_login) AS owner_login, COUNT(1) AS repos
            -> FROM github_repos r
            -> WHERE updated_at IS NULL
            -> GROUP BY owner_id
            -> ORDER BY repos DESC
            -> LIMIT 100;
100 rows in set
Time: 6.186s

For example:

  • @direwolf-github has created more than 3 million repos in the past.
  • @codestar-github-bot-1 has created more than 1.49 million repos in the past.
+-----------+-----------------------------------------+---------+
| owner_id  | owner_login                             | repos   |
+-----------+-----------------------------------------+---------+
| 10810283  | direwolf-github                         | 3094769 |
| 53108186  | codestar-github-bot-1                   | 1492979 |
| 75841818  | shalini-devgit                          | 716358  |
| 1686007   | efarbereger                             | 461036  |
| 42283162  | aws-aemilia                             | 426637  |
| 52290251  | botcopado                               | 415988  |
| 50472621  | Likj128                                 | 409137  |
| 16960717  | idsqat1                                 | 171634  |
| 19253721  | pmamico                                 | 171172  |
| 51386346  | c3-github-integration-test-user-3       | 171072  |
| 29713643  | autotester-one                          | 166855  |
| 37852658  | aws-risk-analysis                       | 158363  |
| 46203446  | mathias7777                             | 128096  |
| 2804557   | dvcsconnectortest                       | 123299  |
| 87753227  | thatjohn0a                              | 121717  |
| 20675874  | isisliu                                 | 121612  |
| 85441621  | thatjohn01                              | 118020  |
| 87852462  | thathttp01                              | 114024  |
| 31313482  | srbbtest                                | 112461  |
| 18148588  | idsb3t1                                 | 108922  |
| 88047089  | sometim110                              | 107797  |
| 87852675  | thatfeel01                              | 105234  |
| 32371408  | codestar-bot                            | 104297  |
| 18310276  | favroautomation                         | 90209   |
| 7911893   | slskopytko                              | 88340   |
| 88182477  | yoobao55                                | 80997   |
| 20931201  | atomist-test-web                        | 70762   |
| 21099664  | atomist-web-test-staging                | 65375   |
| 15947550  | c3-github-integration-test-user         | 63508   |
| 22006877  | luis-ordering                           | 60877   |
| 91138975  | appsmithgitsynctest                     | 54150   |
| 30481173  | osiotestmachine                         | 53008   |
| 78093351  | newproject1                             | 52672   |
| 88618086  | YQjydnbj                                | 46360   |
| 88552392  | UTguzkiu                                | 46156   |
| 88569843  | Grvrlrkm                                | 46018   |
| 88953461  | L2scbmpn                                | 45983   |
| 88954328  | kdqqxmoq                                | 45859   |

The top 1000 owners have created more than 13.2 million repos.

gharchive_dev> SELECT SUM(repos) AS total
            -> FROM (
            ->   SELECT
            ->       owner_id, ANY_VALUE(owner_login) AS owner_login, COUNT(1) AS repos
            ->   FROM github_repos r
            ->   WHERE updated_at IS NULL
            ->   GROUP BY owner_id
            ->   ORDER BY repos DESC
            ->   LIMIT 1000
            -> ) sub;
+----------+
| total    |
+----------+
| 13285264 |
+----------+
1 row in set
Time: 21.423s

I will check out them and mark them is_deleted = 1 in batch if they have been deleted.

gharchive_dev> UPDATE github_repos SET is_deleted = 1 WHERE owner_id = 53108186 AND repo_id NOT IN(496340058, 496341317, 496337981, 496337632, 496341286, 496338008, 4
            -> 96341057, 496339354, 496338643);
Query OK, 1492979 rows affected
Time: 76.872s

Mini256 avatar Sep 21 '22 04:09 Mini256

When obtaining the repos created in 2017, the GitHub API probably returns an error. This will cause us to lose a lot of data.

 ERROR  Failed to fetch repos:  {
  q: 'fork:true created:2017-01-24T03:22:30.000+00:00..2017-01-24T03:45:00.000+00:00',
  cursor: 'Y3Vyc29yOjEwMA=='
} Request failed due to following response errors:
 - Something went wrong while executing your query. Please include `E322:7BD7:217F13F:2261FF6:634EC1F7` when reporting this issue.

  - Something went wrong while executing your query. Please include `E322:7BD7:217F13F:2261FF6:634EC1F7` when reporting this issue.
  at node_modules/.pnpm/@[email protected]/node_modules/@octokit/graphql/dist-src/graphql.js:48:19
  at async extractReposForTimeRange (scripts/sync-repos/syncer.ts:179:25)
  at async scripts/sync-repos/syncer.ts:45:36


 ERROR  Failed to fetch repos:  {
  q: 'fork:true created:2017-01-24T03:22:30.000+00:00..2017-01-24T03:45:00.000+00:00',
  cursor: 'Y3Vyc29yOjEwMA=='
} Request failed due to following response errors:
 - Something went wrong while executing your query. Please include `E324:5A7B:A10702D:A5CE8EA:634EC1FC` when reporting this issue.

  - Something went wrong while executing your query. Please include `E324:5A7B:A10702D:A5CE8EA:634EC1FC` when reporting this issue.
  at node_modules/.pnpm/@[email protected]/node_modules/@octokit/graphql/dist-src/graphql.js:48:19
  at async extractReposForTimeRange (scripts/sync-repos/syncer.ts:179:25)
  at async scripts/sync-repos/syncer.ts:45:36

Mini256 avatar Oct 18 '22 15:10 Mini256

 ERROR  Failed to batch load 475 records, retries: 2/2:  Out of range value for column 'size' at row 182

  at PromisePool.query (node_modules/.pnpm/[email protected]/node_modules/mysql2/promise.js:341:22)
  at BatchLoader.flush (app/core/BatchLoader.ts:86:40)
  at runMicrotasks (<anonymous>)
  at processTicksAndRejections (node:internal/process/task_queues:96:5)
  at async Timeout._onTimeout (app/core/BatchLoader.ts:59:21)


 ERROR  Failed to batch load 475 records, they will be lost:  Out of range value for column 'size' at row 182

  at PromisePool.query (node_modules/.pnpm/[email protected]/node_modules/mysql2/promise.js:341:22)
  at BatchLoader.flush (app/core/BatchLoader.ts:86:40)
  at runMicrotasks (<anonymous>)
  at processTicksAndRejections (node:internal/process/task_queues:96:5)
  at async Timeout._onTimeout (app/core/BatchLoader.ts:59:21)

Mini256 avatar Oct 18 '22 15:10 Mini256

Finished?

sykp241095 avatar Nov 09 '22 14:11 sykp241095