scrapoxy icon indicating copy to clipboard operation
scrapoxy copied to clipboard

retrying blacklisted response

Open domingguss opened this issue 6 years ago β€’ 0 comments

So I am running scrapoxy on an ec2 instance on AWS, and using scrapy in python. Now, this some-website.com is returning an 429 after every 10 requests, so I have enabled the BlacklistDownloaderMiddleware (and the other middlewares) as in the example.

These are the logs:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/190> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/191> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/192> (referer: None)
[spider] DEBUG: Ignoring Blacklisted response https://www.some-website.com/profile/193: HTTP status 429
[urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 13.33.33.37:8889
[urllib3.connectionpool] DEBUG: http://13.33.33.37:8889 "POST /api/instances/stop HTTP/1.1" 200 11
[spider] DEBUG: Remove: instance removed (1 instances remaining)
[spider] INFO: Sleeping 89 seconds
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/194> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/195> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.some-website.com/profile/196> (referer: None)

It indeed ignores the blacklisted response (/profile/193) and continues with /profile/194.

My question: how can I retry crawling /profile/193 easily?

domingguss avatar Nov 23 '18 14:11 domingguss