useful-forks.github.io
useful-forks.github.io copied to clipboard
The queried repository has too many forks
I have hit this unfortunate message
It's also possible that the queried repository has so many forks that it's impossible to scan it completely without running out of API calls. :(
while scanning https://github.com/dbeaver/dbeaver.
I would gladly give up some Local Storage in order to be able to fully scan a repo.
I will just give my whole thought process, even though it might be complicated to do it in one go:
- Keep API calls counter #7, number of API calls still available, and API limit refresh time
- Store all inactive forks in a cache for a some time (e.g. 1 week). Add a tooltip on the side of the page "Skipping x number of forks because they are empty [reset cache]"
- When the number of API calls still available is zero, start the API limit refresh timer as a countdown
- Store the current results in cache, with their fetch timestamp (e.g. for 1 day)
- Add/Enable a button [continue fetching] when the [API limit refresh time reaches zero | it is detected that there are API calls available]
and lastly,
I don't know how feasible it is (it doesn't look like it), but server-side caching shouldn't be too much (I don't think that table would be too much to store as JSON), and might help "fight" the scourge empty forks.
Yes, this is a very unfortunate limitation of the tool. I contacted GitHub to try to figure out what can be done, but so far it seems like there isn't an easy solution.
The problem is that I don't really have access to a persistence layer (the server-side statefulness you talk about): the website is just a static page generated by GitHub's servers. The only persistence we can do is locally.
The easiest solution I can think of would be to stall the API Requests for a whole hour when the limit is reached. Then, when that countdown would expire, the algorithm could keep running from where it paused at.
Another potential solution would be to include more optional URL parameters which would guide the algorithm in terms of where it begins its scan, but that's also prone to error and a clumsy implementation.
But I think the real question here is: are users really willing to learn a (rather frustrating and cumbersome) process to complete their forks-scan? Anything that involves a "please come back later" solution sounds like it isn't necessarily worth coding.
But I think the real question here is: are users really willing to learn a (rather frustrating and cumbersome) process to complete their forks-scan? Anything that involves a "please come back later" solution sounds like it isn't necessarily worth coding.
I never talked about "users learning", but yeah - maybe it's indeed too much.
Let's leave it float until people start discovering this more.
Alright, so I have been thinking a bit more about this: I think there is a way to integrate your idea in a sufficiently nice way.
UI proposal
We'd be talking about a new icon at the top-right, in the header nav-bar: a little bell icon with a red notification number that would indicate how many interrupted scans are stored in the local cache. When clicking on the icon, it should open either a small notification modal dialog just below the icon, or a sliding partial modal dialog coming from the right-side of the screen.
That modal dialog should display clickable regions containing information about the interrupted scans. Clicking the regions would launch the saved search parameter.
The cache
- I think the cache's code should probably be a new file in
website/src. - The
add_fork_elementsfunction would be modified to call the stored cache. - Maybe it would be nice to provide an option for users to clear it on demand?
Rate-Limit considerations
I also made a little bit of research to see how the Rate-Limit could be increased. Here is what I found:
- Using REST was probably a mistake. I'm not very familiar with GraphQL, but from what I understand its Rate-Limit is calculated differently. Making the switch would probably be extremely beneficial (and I think it would even resolve this old issue).
- I'm currently exploring how applying for a GitHub Apps might actually increase the limit.
- The GitHub API provides conditional requests which could be a way to take advantage of their server-side caching.
- Calls to the
GET Rate-Limitendpoint do not count against the REST API rate-limit.
All sound awesome to me. Unfortunately, you will not be able to solve #7 cleanly (as GraphQL uses a "normalized score" and does not count "API requests"), but that's a very small problem.
As it was determined in another issue, this problem actually concerns GitHub API's abuse detection mechanism due to how fast the tool sends requests (indeed, this problem arises before even reaching 10% of the allowed rate-limit).
I created a thread on the GitHub Forum: hopefully we can get more answers there.
while scanning https://github.com/dbeaver/dbeaver
PR #15 fixed the scan for that specific repository (since it was actually an "Abuse Detection Mechanism" that was at play there).
Nonetheless, I'll leave this issue opened because it doesn't solve the real question at stake here: what to do when we try to scan a massive repository which will always reach the Rate Limit provided by GitHub API? An example would be libgdx/libgdx.
I tried both dbeaver/dbeaver and libgdx/libgdx and could reach a full scan without encountering this issue anymore.
I'll close it for now, until someone reports back that they've actually seen this error, and can provide a reproducible repo input.
I think the current implementation of the Octokit requests properly handle the rate-limiting in such a way as to prevent GitHub from refusing to reply.