criticality_score icon indicating copy to clipboard operation
criticality_score copied to clipboard

Find workaround for github api limitation of 1000 repos in results, expand sample set beyond 1000 by stars before sorting by criticality score

Open inferno-chromium opened this issue 4 years ago • 21 comments

inferno-chromium avatar Dec 13 '20 16:12 inferno-chromium

Is GHTorrent (developed by @gousiosg for software analytics) a potential source of data for the criticality score? See https://ghtorrent.org/. There are instructions for access to the database here: https://ghtorrent.org/raw.html.

Also, given the spirit of this project, GHTorrent seems like a project worth supporting :).

tgamblin avatar Dec 13 '20 22:12 tgamblin

Hi @tgamblin thanks for the plug :-) GHTorrent is not suitable for this as it may have data freshness and general quality issues.

gousiosg avatar Dec 13 '20 22:12 gousiosg

Depending on the threshold set, you can get quite a bit of repos from the GitHub search API .. is this currently the biggest hurdle?


Would external sites (with an API) for specific languages help? I.e., I built a website that aggregates GitHub data for the top Go repositories with >50 stars. Its ~15k repos and tracks ~9mil stars (both increase/decrease)

The freshness of the data is 1 hour.

I'd imagine other websites such as bestofjs.org would have "fresh" data.


I've ported this Python project to Go but it turns out in its current implementation the number of hits against the API limits is quite high per single repo, esp. the open/closed issue count. So I've now switched to optimizing it with a combination of REST V3 and GraphQL V4 (which iirc count as distinct rate limits).

My intention is to add a criticality score to the 15k Go repos tracked by said website.

mfridman avatar Dec 13 '20 23:12 mfridman

Depending on the threshold set, you can get quite a bit of repos from the GitHub search API .. is this currently the biggest hurdle?

I am thinking on workarounds, maybe by partitioning sets by stars count, etc. How are you doing this on your website.

Would external sites (with an API) for specific languages help? I.e., I built a website that aggregates GitHub data for the top Go repositories with >50 stars. Its ~15k repos and tracks ~9mil stars (both increase/decrease)

The freshness of the data is 1 hour.

Sure, can you expose an api end point that return a github repo list (e.g. kubernetes/kubernetes) when given a language by stars.

I'd imagine other websites such as bestofjs.org would have "fresh" data.

I've ported this Python project to Go but it turns out in its current implementation the number of hits against the API limits is quite high per single repo, esp. the open/closed issue count. So I've now switched to optimizing it with a combination of REST V3 and GraphQL V4 (which iirc count as distinct rate limits).

My intention is to add a criticality score to the 15k Go repos tracked by said website.

This is really exciting, thanks @mfridman for working on this.

inferno-chromium avatar Dec 14 '20 03:12 inferno-chromium

I am thinking on workarounds, maybe by partitioning sets by stars count, etc. How are you doing this on your website.

To populate the "repo list" I use the GraphQL API search (and then parse/store results in DB). The neat thing is the search query can return the repository information (and a wealth of other edges), so you're not going back to the API to get at more data.

pseudo graphql query:

{
  rateLimit {
    cost
    remaining
    resetAt
    used
  }
  repo_owner_search: search(query: "language:go stars:>=50 created:2008..2012", type: REPOSITORY, first: 100) {
      # graphql fields like repo, owner, issues
      # the more complex the query the higher the cost. Careful! Only grab what you need.
}

Note, the GitHub search query will return max 1000 results TOTAL in batches of 100. It's very important to check the supplied query produced <1000 results, otherwise risk gaps.

There are 2 implementations:

  1. First pass

This is a crude scan that starts at a defined year (2008, specific to Go .. this is the first year a repo appears) and goes up-to 2012 by one year increments. (the results are guaranteed to return <1000). After that the search drops down to 6 month increments, and checks on each run that results <1000 (if >= then reset cursor and try 3 months, etc. until <1000 results)

  1. Subsequent fetches

On subsequent fetches we roughly know the data to expect, so build dynamic queries that target 90% results (~900). Why not 100%? Leave room for unexpected Repos that may have reached >50 stars, those whose status flipped Public to Private, etc.

Lastly, when building dynamic date range there might be "gap" between page dates, example:

|start|---|end| ..gap.. |start|---|end|

So best to use the the prior end date as the start date for all queries except the first.


The second search completes ~15k repos in 436.0s (7.3min), and iirc this approach uses very little of the API limit.

I did notice, sometimes, the GitHub API fails to return repos in the search results. I suspect they have data loaders on their GraphQL backend and these are notorious for resulting in missing data.


Sorry, this got a bit long and very focused on Go (side-project limited by time and resources). So the best I can offer is an API that exposes all the data that I have, but again, specific to Go only.

I'm quite excited, because (although not perfect) this criticality score will add another data point and it'll be interesting to inspect the intersection of stars, criticality score, dependency graph, etc. for reporting on critical projects.

Really awesome stuff :)

mfridman avatar Dec 14 '20 05:12 mfridman

Reopening till the list are regenerated, thanks to slow api, will take some time, probably this week.

inferno-chromium avatar Dec 14 '20 23:12 inferno-chromium

python list is regenerated, urllib3 now shows up in top 200. working on java list next and others.

inferno-chromium avatar Dec 15 '20 16:12 inferno-chromium

@inferno-chromium Probably outside the scope of this ~PR~ issue, but do you think it'd be possible to record the number of API hits required to generate a score per repo. Maybe add this to the list itself?

Knowing this number would help optimize queries, thoughts?

mfridman avatar Dec 15 '20 16:12 mfridman

Conda shows up in the new python data, but Spack (which has a higher criticality score at .78) still doesn’t. Spack has 1.9k stars (https://github.com/spack/spack). What’s the star count of the 5000th package?

tgamblin avatar Dec 15 '20 16:12 tgamblin

https://github.com/spack/spack

it is 1481. there is some bug in github search api, it sometimes misses result, so painful. reran generate again and it does show in list. pain, will debug more.

inferno-chromium avatar Dec 15 '20 18:12 inferno-chromium

it sometimes misses result, so painful. reran generate again and it does show in list. pain, will debug more.

Sorry this is so painful!

tgamblin avatar Dec 15 '20 18:12 tgamblin

java, C, python and php list fixed too, rest will be fixed end of week, generate script working fine. e.g. in java, maven, gradle and geotools now showing up correctly. closing.

inferno-chromium avatar Dec 17 '20 16:12 inferno-chromium

@inferno-chromium: looks like https://github.com/spack/spack still doesn’t show up in the python list.

tgamblin avatar Jan 05 '21 05:01 tgamblin

@tgamblin - can you play around with this python3 -u -m criticality_score.generate --language python --output-dir=/tmp i dont see spack coming out of github search api, there is some weird bug, would appreciate any debugging help, check generate.py and function get_github_repo_urls_for_language

inferno-chromium avatar Jan 05 '21 06:01 inferno-chromium

@inferno-chromium: ok looking into it. I can replicate the issue (and yes the search results are really weirdly ordered).

While I'm doing that, I noticed that by ignoring the term "book", you're ignoring every OSS project from Facebook, as well as various Jupyter notebook projects, so maybe there's a better way to do the ignoring. Or maybe the facebook thing is intentional 😬. Hope I'm not outing any nefarious plans at Google 😉.

tgamblin avatar Jan 05 '21 08:01 tgamblin

So looking at the search results from GitHub, I can't get Spack to appear in a search unless I specifically ask for sp in:name or spack in:name. it's consistently not in the results without criteria like that.

Also, I think the algorithm used to handle chunks of query results may not handle the variability in GitHub results. Right now the search looks for all repos, processes the first 1,000, looks at the last star count, adds 100, and searches for repos with stars less than that number. But if you look at the variability of results, here's what happens.

repos repos-2000

The first plot shows the stars counts for the first 1250 or so search results -- note that it doesn't decrease monotonically. The second plot shows star counts for search results with a constraint stars:<2000. It's pretty horrible. Note that the variability is higher than the 100-star fudge factor you're using, so that might end up inadvertently causing some results to be missed -- hard to tell. I can't get Spack to appear regardless of where I start the star count, though - it seems like I have to add at least sp in:name to get it to show up.

I'll keep digging a bit.

tgamblin avatar Jan 05 '21 08:01 tgamblin

@inferno-chromium: ok looking into it. I can replicate the issue (and yes the search results are really weirdly ordered).

While I'm doing that, I noticed that by ignoring the term "book", you're ignoring every OSS project from Facebook, as well as various Jupyter notebook projects, so maybe there's a better way to do the ignoring. Or maybe the facebook thing is intentional 😬. Hope I'm not outing any nefarious plans at Google 😉.

That was unintentional, removed book in https://github.com/ossf/criticality_score/commit/3143a8b5d73e838e30ec6c65aecce7f5ff7dca76

inferno-chromium avatar Jan 05 '21 16:01 inferno-chromium

So looking at the search results from GitHub, I can't get Spack to appear in a search unless I specifically ask for sp in:name or spack in:name. it's consistently not in the results without criteria like that.

Also, I think the algorithm used to handle chunks of query results may not handle the variability in GitHub results. Right now the search looks for all repos, processes the first 1,000, looks at the last star count, adds 100, and searches for repos with stars less than that number. But if you look at the variability of results, here's what happens.

repos repos-2000

The first plot shows the stars counts for the first 1250 or so search results -- note that it doesn't decrease monotonically. The second plot shows star counts for search results with a constraint stars:<2000. It's pretty horrible. Note that the variability is higher than the 100-star fudge factor you're using, so that might end up inadvertently causing some results to be missed -- hard to tell. I can't get Spack to appear regardless of where I start the star count, though - it seems like I have to add at least sp in:name to get it to show up.

I'll keep digging a bit.

If you have any other ideas to try, happy to accept patches.

inferno-chromium avatar Jan 05 '21 17:01 inferno-chromium

@inferno-chromium & @tgamblin I modified "generate.py" to search the repos by using stars with upper & lower limits.

For instance, Spack has 1906 stars at the moment. When I search in that range, I can see that it appears in the list: https://github.com/search?o=desc&p=4&q=archived%3Afalse+stars%3A1905..1906&s=stars&type=Repositories

It think somehow giving a range like this returns consistent results.

I'm scanning from top to bottom, and reached to 25K results (318K to 1045 stars). I can see that Spack is already in the list: https://github.com/coni2k/criticality_score/blob/main/results/all_25000.csv

If you want to look at the code (I'm changing upper & lower manually at the moment): https://github.com/coni2k/criticality_score/blob/main/criticality_score/generate.py#L69

If you think that this is a better approach, I can improve the code & send you a PR?

coni2k avatar Jan 16 '21 18:01 coni2k

@inferno-chromium & @tgamblin I modified "generate.py" to search the repos by using stars with upper & lower limits.

For instance, Spack has 1906 stars at the moment. When I search in that range, I can see that it appears in the list: https://github.com/search?o=desc&p=4&q=archived%3Afalse+stars%3A1905..1906&s=stars&type=Repositories

It think somehow giving a range like this returns consistent results.

I'm scanning from top to bottom, and reached to 25K results (318K to 1045 stars). I can see that Spack is already in the list: https://github.com/coni2k/criticality_score/blob/main/results/all_25000.csv

If you want to look at the code (I'm changing upper & lower manually at the moment): https://github.com/coni2k/criticality_score/blob/main/criticality_score/generate.py#L69

If you think that this is a better approach, I can improve the code & send you a PR?

Sure @coni2k thanks for looking into this. if limits help to remove variability, that is nice. Ideally we want to get rid of this repetition as well - https://github.com/ossf/criticality_score/blob/main/criticality_score/generate.py#L135. Will wait for your PR, i am curious how you solve this - https://github.com/coni2k/criticality_score/blob/main/criticality_score/generate.py#L58 since we can't ask on upper bound. Also, check out this years suggestion - https://github.com/ossf/criticality_score/issues/33#issuecomment-744175831

inferno-chromium avatar Jan 16 '21 19:01 inferno-chromium

@inferno-chromium I was just playing with the queries; it's not only adding lower & upper range but giving a "bigger than" filter on stars also make it consistent.

For instance, this keeps returning the same results: https://github.com/search?q=archived%3Afalse+stars%3A%3E50&type=Repositories&s=stars

Smaller the number, the problematic it gets.

"Stars > 0" keeps giving different results, probably similar to not having stars filter: https://github.com/search?q=archived%3Afalse+stars%3A%3E0&type=Repositories&s=stars https://github.com/search?q=archived%3Afalse&type=Repositories&s=stars

"Stars > 1" is better but still flaky: https://github.com/search?q=archived%3Afalse+stars%3A%3E1&type=Repositories&s=stars

Probably it's better to have a fixed lower limit for stars and always look at the repos bigger than that number. Something between 10 & 100?

I need to look into how to do the implementation but probably what we can do is to make an initial query by using "bigger than" (stars > 50) to find out the upper limit (read the stars of the first repo) and start making the actual queries by using a range (50 ~ 9000, 50 ~ 7500, 50 ~ 5000...).

What do you think?

coni2k avatar Jan 17 '21 19:01 coni2k