useful-forks.github.io icon indicating copy to clipboard operation
useful-forks.github.io copied to clipboard

The queried repository has too many forks

Open stdedos opened this issue 4 years ago • 6 comments
trafficstars

I have hit this unfortunate message

It's also possible that the queried repository has so many forks that it's impossible to scan it completely without running out of API calls. :(

while scanning https://github.com/dbeaver/dbeaver.

I would gladly give up some Local Storage in order to be able to fully scan a repo.

I will just give my whole thought process, even though it might be complicated to do it in one go:

  1. Keep API calls counter #7, number of API calls still available, and API limit refresh time
  2. Store all inactive forks in a cache for a some time (e.g. 1 week). Add a tooltip on the side of the page "Skipping x number of forks because they are empty [reset cache]"
  3. When the number of API calls still available is zero, start the API limit refresh timer as a countdown
  4. Store the current results in cache, with their fetch timestamp (e.g. for 1 day)
  5. Add/Enable a button [continue fetching] when the [API limit refresh time reaches zero | it is detected that there are API calls available]

and lastly,

I don't know how feasible it is (it doesn't look like it), but server-side caching shouldn't be too much (I don't think that table would be too much to store as JSON), and might help "fight" the scourge empty forks.

stdedos avatar Feb 25 '21 22:02 stdedos

Yes, this is a very unfortunate limitation of the tool. I contacted GitHub to try to figure out what can be done, but so far it seems like there isn't an easy solution.

The problem is that I don't really have access to a persistence layer (the server-side statefulness you talk about): the website is just a static page generated by GitHub's servers. The only persistence we can do is locally.

The easiest solution I can think of would be to stall the API Requests for a whole hour when the limit is reached. Then, when that countdown would expire, the algorithm could keep running from where it paused at.

Another potential solution would be to include more optional URL parameters which would guide the algorithm in terms of where it begins its scan, but that's also prone to error and a clumsy implementation.

But I think the real question here is: are users really willing to learn a (rather frustrating and cumbersome) process to complete their forks-scan? Anything that involves a "please come back later" solution sounds like it isn't necessarily worth coding.

payne911 avatar Feb 25 '21 23:02 payne911

But I think the real question here is: are users really willing to learn a (rather frustrating and cumbersome) process to complete their forks-scan? Anything that involves a "please come back later" solution sounds like it isn't necessarily worth coding.

I never talked about "users learning", but yeah - maybe it's indeed too much.

Let's leave it float until people start discovering this more.

stdedos avatar Feb 25 '21 23:02 stdedos

Alright, so I have been thinking a bit more about this: I think there is a way to integrate your idea in a sufficiently nice way.


UI proposal

We'd be talking about a new icon at the top-right, in the header nav-bar: a little bell icon with a red notification number that would indicate how many interrupted scans are stored in the local cache. When clicking on the icon, it should open either a small notification modal dialog just below the icon, or a sliding partial modal dialog coming from the right-side of the screen.

That modal dialog should display clickable regions containing information about the interrupted scans. Clicking the regions would launch the saved search parameter.

The cache

  • I think the cache's code should probably be a new file in website/src.
  • The add_fork_elements function would be modified to call the stored cache.
  • Maybe it would be nice to provide an option for users to clear it on demand?

Rate-Limit considerations

I also made a little bit of research to see how the Rate-Limit could be increased. Here is what I found:

payne911 avatar Feb 26 '21 15:02 payne911

All sound awesome to me. Unfortunately, you will not be able to solve #7 cleanly (as GraphQL uses a "normalized score" and does not count "API requests"), but that's a very small problem.

stdedos avatar Feb 26 '21 16:02 stdedos

As it was determined in another issue, this problem actually concerns GitHub API's abuse detection mechanism due to how fast the tool sends requests (indeed, this problem arises before even reaching 10% of the allowed rate-limit).

I created a thread on the GitHub Forum: hopefully we can get more answers there.

payne911 avatar Feb 27 '21 05:02 payne911

while scanning https://github.com/dbeaver/dbeaver

PR #15 fixed the scan for that specific repository (since it was actually an "Abuse Detection Mechanism" that was at play there).

Nonetheless, I'll leave this issue opened because it doesn't solve the real question at stake here: what to do when we try to scan a massive repository which will always reach the Rate Limit provided by GitHub API? An example would be libgdx/libgdx.

payne911 avatar Mar 01 '21 06:03 payne911

I tried both dbeaver/dbeaver and libgdx/libgdx and could reach a full scan without encountering this issue anymore. I'll close it for now, until someone reports back that they've actually seen this error, and can provide a reproducible repo input. I think the current implementation of the Octokit requests properly handle the rate-limiting in such a way as to prevent GitHub from refusing to reply.

payne911 avatar Dec 08 '22 06:12 payne911