criticality_score Make github queries determinsitic

Fixes #33. Closes #76. (I think this supersedes it)

The current query method doesn't work well with GitHub's API because it makes queries that frequently return > 1,000 results. For GitHub to be deterministic, you have to fetch < 1,000 results at a time. This is not straightforward to do with stars, because stars obey a power law. The lower the star minimum in a query, the more likely you are to get lots of results, because there are many repos with not lower numbers of stars.

This PR does things:

Try to keep the total number of API calls made to GitHub small.
Dynamically adjust the min and max stars we request via the GitHub API to contra the size of result sets.

To do (1), we start by requesting repos with > 65,536 stars, and initially we just halve that number each time through the fetch loop. This rapidly ramps up the size of the query result sets.

To do (2), we also keep track of the size of result sets. In particular, we look at how much bigger the last result set was than the one before, and we extrapolate how big the next one will be based on that. It turns out that this is a good upper bound -- if you halve the minimum star count and the result set is k times larger than the one before, the following result set is typically < k times bigger than the most recent one. If the expected number of results is too big (i.e. greater than a threshold) we reduce the divisor so that the next window is half as big as it would've been.

With these changes I can:

Get repeatable results from GitHub
Eliminate the 3 redundant rounds the generate code used to do
Grab >5,000 results with ~14 API calls

I ran this a few times and got the same repositories in the result set each time. I do see star counts change from query to query. Because of that:

[ ] This may need to add a buffer (i.e. the star ranges in the queries should probably overlap by some fudge factor so we don't miss rapidly moving repos). I didn't see any missed repos when I tried it, though.

It's also possible that even with this steering, we can still get too many query results (e.g., if for some language, there are WAY more repos in some star range).

[ ] Right now, this just bails if it gets more than 1,000 results from GitHub. It should probably shrink the star window and retry if it gets close to 1,000, but that's not done yet.

Thanks to @mfridman and @coni2k for hints that led to this on #33.

@inferno-chromium, @coni2k, @mfridman what do you think?

Apr 10 '21 10:04 tgamblin

the reason for the repeated loops was also to cover race condition w.r.t star changes. while running the query, if stars change, you lose that repo altogether. for popular repo, it can be common, that is why i fixed a dozen where people found repos missing. i am fine with using more github api quota for more accurate results.

Apr 11 '21 00:04 inferno-chromium

@inferno-chromium: that can be fixed, as mentioned above, by making the queries overlap. I’ll add that.

Apr 11 '21 00:04 tgamblin

criticality_score criticality_score copied to clipboard

Make github queries determinsitic

criticality_score
criticality_score copied to clipboard