pyinaturalist
pyinaturalist copied to clipboard
HTTP 429 Rate Limit error on reading observations
The problem
We have a one off process pulling data from iNaturalist. After running for a minute or two, we are seeing the following HTTP 429 error.
2024-03-26 18:22:47 INFO ----------------------
2024-03-26 18:22:47 INFO Request:
GET https://api.inaturalist.org/v1/observations?id=2869994&only_id=false
User-Agent: python-requests/2.31.0 pyinaturalist/0.18.0
Accept-Encoding: gzip, deflate
Accept: application/json
Connection: keep-alive
2024-03-26 18:22:48 INFO Rate limit exceeded for https://api.inaturalist.org/v1/observations?id=2869994&only_id=false; filling limiter bucket
Traceback (most recent call last):
File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/mainMigrate.py", line 37, in <module>
main(param1)
File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/mainMigrate.py", line 28, in main
copy_count = copier.copyiNatLocations_to_existing_CAMS_features(how_many_records_to_migrate)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/migration/migrate.py", line 58, in copyiNatLocations_to_existing_CAMS_features
observation = self.get_observation_from_id(observationID)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/inaturalist-to-cams/inaturalist-to-cams/migration/migrate.py", line 36, in get_observation_from_id
observation = pyinaturalist.get_observation(observation_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/v1/observations.py", line 583, in get_observation
response = get_observations(id=observation_id, access_token=access_token, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/forge/_revision.py", line 328, in inner
return callable(*mapped.args, **mapped.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/v1/observations.py", line 81, in get_observations
observations = get(f'{API_V1}/observations', **params).json()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/session.py", line 358, in get
return session.request('GET', url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/pyinaturalist/session.py", line 271, in request
response.raise_for_status()
File "/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://api.inaturalist.org/v1/observations?id=2869994&only_id=false
Error: Process completed with exit code 1.
Expected behavior
Our understanding was that pyinaturalist would apply the rate limiting to satisfy iNaturalist, and we are only seeing about one request per second.
Steps to reproduce the behavior
Create a script to repeatedly call pyinaturalist.get_observation(observation_id)
Workarounds
We could add a wait to our code
Environment
- OS & version: Ubuntu 22.04.4
- Python version: CPython 3.11.8
- Pyinaturalist version or branch: 0.18.0
I will probably need some more info about how this process is running. Is it running from a CI system or cloud storage provider with ephemeral storage? Does it use multiprocessing? Is it connecting to iNat from an IP shared with other services that also connect to iNat?
The API has per second, per minute, and per day rate limits, tracked per IP address (some more details here). To track these limits on the client side, a small persistent SQLite table records when recent requests have been made (via requests-ratelimiter + pyrate-limiter). That's sufficient for a single process and multithreading with persistent storage, but some extra work is needed to handle some other scenarios like the ones mentioned above.
The process is running as a GitHub Action. It is single threaded. It's possible that GitHub shares the IP with other processes, however, we have another job running hourly on the same infrastructure that has never seen these issues.
@amazing-will - We could also test it from a local machine with the same parameters to see if we get the same results?
The thing that is different about this process is that it is retrieving 1 iNaturalist record at a time, whereas our other process pages 200 at a time. This is mostly since we are working from a list of observations and processing one at a time. Since it is a one-off process that only needs to process about 4,000 observations, this was a shortcut to get it working quickly. We could modify the code to read more at a time, but it would probably be easier to just add a delay, since the speed of execution isn't an issue.
We implemented a workaround of adding a 1 second delay to the processing of each record. It's not ideal but is OK for a one-off job. We might look to rework our code if we need to run something similar on an ongoing basis.
Looking through the logs it appears to have been right on the cusp of what is allowed by iNaturalist, processing 60 records in 60 seconds. I wonder if it could be explained by network latency of a few ms causing iNaturalist to receive 60 requests in a few ms less than 60 seconds?
Anyway, I'll close for now since we no longer have this issue with our workaround