scrapy
scrapy copied to clipboard
Scrapy seems to fail to load some sites when using proxy or user agent middleware?
Description
I am trying to use some HTTP proxies with Scrapy in order to reduce the duration between crawl times and I have seem to be having issues no matter which middleware I use.
The more recent proxy middleware I have tried is as follows (seems most up-to-date): https://github.com/TeamHG-Memex/scrapy-rotating-proxies
The above works fine for most sites, but something problematic is occurring for some repeat offenders. I have noticed that these same websites have the same kind of connection issues when trying to use middlewares for user agent switches.
As soon as I remove all of these middlewares (proxy / user agent), the issues with these sites go away. I cannot access them with one or both.
I am raising this issue here as opposed to the individual middleware's github as I seem to be experiencing this across the board, so not sure if this is something under the hood or not.
A recent example of this is as follows:
https://very.co.uk https://www.very.co.uk https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100
I can successfully hit very.co.uk with scrapy fetch
(passing my user agent), however as soon as I get the 301 redirect, something goes wrong and the connection fails to the redirected URL. I cannot successfully fetch/request https://www.very.co.uk during a crawl or a fetch when using a proxy.
At first I suspected that I may have an issue with the proxies that I'm using (i.e. access denied due to being blocked), so I tried to access both pages with Curl and I successfully received the 301 response and a subsequent http 200 (with response data) for the second URL which I'm unable to access with Scrapy when using the same proxy.
curl --proxy http://myuser:mypass@myproxyip:80 -v -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "https://www.very.co.uk"
Steps to Reproduce
-
scrapy fetch "https://very.co.uk" -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
(with the proxy middleware installed / enabled)
Expected behavior: [What you expect to happen] 301 --> https://www.very.co.uk --> HTTP 200
Actual behavior: [What actually happens] 301 --> https://www.very.co.uk --> Dead / timeouts / connection failures
Reproduces how often: [What percentage of the time does it reproduce?] 100%
Versions
Scrapy : 2.4.1 lxml : 4.6.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Windows-10-10.0.19041-SP0
Is the issue reproducible setting any proxy using request.meta['proxy']
?
Thanks for your reply!
So long as the following is correct then I get the same result.
yield scrapy.Request(url=url, callback=self.parse, meta={'item': k, 'match_term': match_term, 'tables': tables, 'proxy': 'http://user:pass@ip:80'})
Crawl of https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100
2020-12-23 10:21:03 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx>
Traceback (most recent call last):
File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\scrapy\core\downloader\middleware.py", line 45, in process_request
return (yield download_func(request=request, spider=spider))
File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 375, in _cb_timeout
raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx took longer than 120.0 seconds..
Crawl of the same URL/spider without proxy
2020-12-23 10:29:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx> (referer: None)
2020-12-23 10:29:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xss> (referer: None)
2020-12-23 10:29:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?p5> (referer: None)
2020-12-23 10:29:14 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-23 10:29:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1031,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 172171,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'elapsed_time_seconds': 3.391417,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 12, 23, 10, 29, 14, 793605),
'log_count/DEBUG': 33,
'log_count/INFO': 8,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 12, 23, 10, 29, 11, 402188)}
2020-12-23 10:29:14 [scrapy.core.engine] INFO: Spider closed (finished)
Curl of the same URL with the same proxy settings
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js desktop ie6 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 7]> <html class="no-js desktop ie7 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 8]> <html class="no-js desktop ie8 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 9]> <html class="no-js desktop ie9" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if gt IE 9]><!--> <html class="no-js desktop" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <!--<![endif]-->
<head>
<meta http-equiv="Content-Language" content="en" />
<title>Shop All Consoles | www.very.co.uk</title>
<link rel="canonical" href="https://www.very.co.uk/e/promo/shop-all-consoles.end" />
<meta name='description' content="Shop All Consoles at very.co.uk. Discover our huge range and get outstanding deals in the latest Shop All Consoles from very.co.uk."/>
<meta name='keywords' content="Shop All Consoles"/>
<meta name="robots" content="noindex,follow" />
<link rel="preconnect" href="http://css.very.co.uk/">
<link rel="preconnect" href="http://js.very.co.uk/">
<link rel="preconnect" href="http://content.very.co.uk/">
<link rel="preconnect" href="http://media.very.co.uk/">
<link rel="preconnect" href="http://speedtrap.shopdirect.com/">
<script>(function(){if(sessionStorage.sdFontsLoaded){document.documentElement.className+=" fonts-loaded";}}());</script>
<style>#headerWrap {position: relative;}</style>
<!-- ---------- All | All | D | Console Polyfill JS Slot Start ---------- -->
<script type="text/javascript">
(function(con) {
'use strict';
var prop, method;
var empty = {};
var dummy = function() {};
var properties = 'memory'.split(',');
var methods = ('assert,clear,count,debug,dir,dirxml,error,exception,group,' +
'groupCollapsed,groupEnd,info,log,markTimeline,profile,profileEnd,' +
'table,time,timeEnd,timeStamp,trace,warn').split(',');
while (prop = properties.pop()) {con[prop] = con[prop] || empty;}
while (method = methods.pop()) {con[method] = con[method] || dummy;}
})(window.console = window.console || {});
</script>
etc...
Could you include the cURL command? I also wonder if using HTTPS in the proxy URL would make a different, but I may be talking nonsense here, it’s been a while since I’ve used a proxy this way.
Could you include the cURL command? I also wonder if using HTTPS in the proxy URL would make a different, but I may be talking nonsense here, it’s been a while since I’ve used a proxy this way.
The Curl command is in the original post mate - If there's a way I can privately message you I am happy to give you these proxy details temporarily so you see what I see if you are unable to replicate.
Just to reiterate too - I experience the same kind of behaviour with simple user agent switcher middlewares too so the proxy may be a bit of a red herring. Really, really odd.
The proxy in question supports http/https though
Does enabling/disabling cookies make any difference?
Does enabling/disabling cookies make any difference?
Cookies had been left enabled as default however I have just tried after disabling them and sadly the result is the same
Just to reiterate too - I experience the same kind of behaviour with simple user agent switcher middlewares too so the proxy may be a bit of a red herring. Really, really odd.
May this be only reproducible when sending multiple requests in a short period (e.g. during the same crawl session) to the same website? Or is it reproducible sending a single request with Scrapy?
It's been a little while since I looked at this now but I believe it failed from request #1 - I couldn't make a successful request at all for specific sites unless I removed the proxy / user agent add-ons etc or remove the manual proxy settings.
OK, I guess the only work left is for someone to reproduce the issue with their own proxy and one of the offending URLs (e.g. https://www.very.co.uk), and see if it’s reproducible with any proxy.
I am working with proxies. I can replicate the issue and try to understand , if I can I will try to contribute the code.
Have you had any luck resoving this ? I am stuck on the same exact issue with different website, curl works , I can even browse the website while having the proxy enabled in firefox settings, but scrapy fails for some reason
Have you had any luck resoving this ? I wasn't able to fix this. Haven't done any scraping for a while now though.
I'm facing the same issue. Any update ?
I may have a lead: it seems to work when the proxy scheme is https
but does not work when it's http
. Also it does not work with authenticated proxy adress.