scrapy-zyte-api
scrapy-zyte-api copied to clipboard
Allow disabling AutoThrottle bypassing
The downloader middleware of scrapy-zyte-api was created to prevent AutoThrottle to affect requests driven through Zyte API, and instead let Zyte API itself control throttling on the server side, sending HTTP 429 responses when a spider is hitting a website too hard.
Relying on Zyte API to handle per-website throttling should most often be the best solution, since Zyte API can have a better picture of the traffic that a website can support and having central throttling control allows running multiple spiders against the same domain in parallel without increasing the overall concurrency to the upstream website.
However, some users might want to let AutoThrottle do its thing anyway. We could implement a setting to let them do just that.
Removing slot.delay = 0
from middleware process_request
https://github.com/scrapy-plugins/scrapy-zyte-api/blob/83df147098622ff96b60bd2b0921371fb469dfaf/scrapy_zyte_api/_downloader_middleware.py#L15-L25
From one side - will make possible to set delay. From other side it will not completely solve this issue because of .. specifics of complicated(customisable) retry_policy functionality.
https://github.com/zytedata/python-zyte-api/blob/0.4.5/zyte_api/aio/retry.py - default scrapy zyte api retry policy (4 attempts, various backoff delays for various cases after each attempt).
Retry requests originated from zyte_api downloader_handler side(that requests didn't pass any of scrapy downloader mw's) - don't update scrapy downloader slots. It means that delay for next scrapy request will be calculated against.. timestamp of first scheduled attempt for previous request.
Currently as a workaround users can use CONCURRENT_REQUESTS_... options; it should give similar results (see https://docs.scrapy.org/en/latest/topics/autothrottle.html#how-it-works), but not exactly the same.
Do we still want this? I am not so sure anymore, however trivial it may be to implement.
is there any update on this one?
I see the attached scrapy PR where it's allowing to disable autothrottle per request / spider by setting download_slot and throttle to false
Will there be a follow-up PR to fix zyte-API to make it respect auto_throttle?
think this would be a useful feature as zyte-api is visiting the site at a rapid rate (close to 180 requests per minute) for a very long time without slowing down ( ~5k to 10k page requests), and at this rate visiting these sites daily to collect data is not best and might look like DDOS!
I am told to use the CONCURRENT_REQUESTS_PER_DOMAIN
and CONCURRENT_REQUESTS_
but as @kmike suggested its not the best way to handle it