scrapy-rotating-proxies
scrapy-rotating-proxies copied to clipboard
Scrapy stuck when page not response
Scrapy stuck when page not response, can I give a timeout for page?
... 2018-01-22 09:27:09 [scrapy.extensions.logstats] INFO: Crawled 183 pages (at 42 pages/min), scraped 183 items (at 42 items/min) 2018-01-22 09:27:09 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 0, reanimated: 3, mean backoff time: 76s) 2018-01-22 09:27:39 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 4, unchecked: 0, reanimated: 0, mean backoff time: 159s) 2018-01-22 09:28:09 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 13 pages/min), scraped 196 items (at 13 items/min) 2018-01-22 09:28:09 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 3, unchecked: 0, reanimated: 1, mean backoff time: 199s) 2018-01-22 09:28:39 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 3, unchecked: 0, reanimated: 1, mean backoff time: 199s) 2018-01-22 09:29:09 [scrapy.extensions.logstats] INFO: Crawled 196 pages (at 0 pages/min), scraped 196 items (at 0 items/min) 2018-01-22 09:29:09 [rotating_proxies.middlewares] INFO: Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 2, mean backoff time: 242s)
It's wait more than 5 minutes to try first retry
Need more info to help you, it could be a network problem on your end or in the server you're scraping from.
I run into the same problem. When using a proxy the default download timeout (of 180 seconds) is used. You can adjust this with download_timeout or via its setting
To explain this: You are running out of proxies. The middleware has default delay of 180 seconds which means it will use proxy A only once every 3 minutes. In your case all of your proxies are still waiting to cool down thus crawler has no proxies/slots and is waiting.
I have this problem too. But in my case the i saw that still have unchecked proxies.
I guess issue #33 's suggested fix fixed it for me. In line 123 of middlewares.py, I replaced if 'proxy' in request.meta and not request.meta.get('_rotating_proxy'):
with if 'proxy' in request.meta:
. I guess this worked for me, but I don't know with absolute surety.