taxize
taxize copied to clipboard
remote API call throttling is broken, at least for NCBI
At least this is the case for NCBI Entrez. Based on looking through the code, NCBI API calls are variously followed by a wait (why wait after the call's already been made??), have no wait preceding or following at all, and do not implement any wait whatsoever if an ENTREZ API key is present.
Perhaps NCBI changed within the last day or two how they measure API call rate, but as a matter of fact tests and vignette builds now fail in RNeXML, all with an NCBI Entrez API rate limit exceeded error. This is whether an API key is set or not; in the former case the error message reports 4 calls/s (with the limit being 3), in the latter case it reports 11 calls/s (with the limit being 10).
Arguably, the code in taxize should not be trying to implement call rate throttling, unless it has a single place through which all calls to a particular API provider run (which it doesn't, at least certainly not for the NCBI API calls). That's because if calls to the API are scattered through different functions, it's impossible to measure whether call rate is about to exceed the limits, and hence in the best case the code (with sleep(0.33)
etc) can hope for needlessly poor but not rate limit-violating performance. (For example, if a client's code is such that it triggers only 3 calls within a second, then no rate limiting or waiting time needs to be applied at all if 3 per second is the rate limit. Currently, such client code is slowed down to taking >=3x the time it would need to take.)
See build failures in ropensci/RNeXML#223.
Turns out the errors started on Dec 1, 2018, which is when NCBI started enforcing API limits. So it can probably be assumed that this always broken, and the difference is that it is showing only now.
To add to the original write-up, in actuality not even with a single point in the code through which all calls run would be able to prevent exceeding the rate limit. For example, tests might be run in parallel, and use the same API key.
It seems the proper algorithm to apply is not to wait prior to (or necessarily after) a remote API call, but instead to check the response headers in case of failure. Specifically, for the NCBI Entrez API, if the response has HTTP status 429, the header retry-after
should give a non-zero number as value (seconds after which to retry?). (Also the headers x-ratelimit-limit
and x-ratelimit-remaining
show what the limit is and how many requests remain, respectively.)
The code seems to use the function retry_until_it_works()
, which sounds like it would be the place to add this to. However, I can't find where this is defined, otherwise I'd send a patch.
I also notice that httr::RETRY()
recognizes out of the box the combination of the retry-after
header and HTTP status 429. So this could be made to work generically without additional code by simply using httr::RETRY()
, if the code hadn't been switched previously to use crul
instead of httr
.
Is there a reason for switching from httr
to crul
? #590 doesn't say why, and 07dea6b7c088fa77ba76d3a0a99fcd23f9d22c0b doesn't either. (It also lies, saying that the switch was only made for Tropicos, and there's no PR apparently that could explain more.)
BTW also found where retry_until_it_works()
is defined. (Why Github search doesn't find it is a mystery.)
waits were added because we attempted to not exceed limits on behalf of the user, which NCBI suggested years back.
it would have been useful if you told me what taxize function you are talking about?
I gave up on httr
because for many years they were unresponsive to bug reports/etc. - i make a lot of web request pkgs, so decided to make my own pkg (crul
) that I control
Does NCBI return these retry-after
headers? i'd be surprised
it would have been useful if you told me what taxize function you are talking about?
It is get_uid()
. This is what RNeXML::taxize_nexml()
calls.
I gave up on
httr
because for many years they were unresponsive to bug reports/etc
I see, I wasn't aware of that.
Does NCBI return these
retry-after
headers? i'd be surprised
Yes they do, actually. (Perhaps only since Dec 1, I did not test prior to that.) The retry method I propose in ropensci/crul#95 actually handles NCBI eutils automatically due to that.
So if that gets accepted, it would make things here pretty simple. Presumably one would just have change the eutils API calls to using retry()
, whether an API key has been set or not. I.e., with an API key the X-RateLimit-Remaining
header runs to zero a little later than without it, but the mechanism for catching that and retrying is exactly the same.
Okay, i'll have a look at the crul
PR