scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Scrapy seems to fail to load some sites when using proxy or user agent middleware?

Open mmotti opened this issue 4 years ago • 14 comments

Description

I am trying to use some HTTP proxies with Scrapy in order to reduce the duration between crawl times and I have seem to be having issues no matter which middleware I use.

The more recent proxy middleware I have tried is as follows (seems most up-to-date): https://github.com/TeamHG-Memex/scrapy-rotating-proxies

The above works fine for most sites, but something problematic is occurring for some repeat offenders. I have noticed that these same websites have the same kind of connection issues when trying to use middlewares for user agent switches.

As soon as I remove all of these middlewares (proxy / user agent), the issues with these sites go away. I cannot access them with one or both.

I am raising this issue here as opposed to the individual middleware's github as I seem to be experiencing this across the board, so not sure if this is something under the hood or not.

A recent example of this is as follows:

https://very.co.uk https://www.very.co.uk https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100

I can successfully hit very.co.uk with scrapy fetch (passing my user agent), however as soon as I get the 301 redirect, something goes wrong and the connection fails to the redirected URL. I cannot successfully fetch/request https://www.very.co.uk during a crawl or a fetch when using a proxy.

At first I suspected that I may have an issue with the proxies that I'm using (i.e. access denied due to being blocked), so I tried to access both pages with Curl and I successfully received the 301 response and a subsequent http 200 (with response data) for the second URL which I'm unable to access with Scrapy when using the same proxy.

curl --proxy http://myuser:mypass@myproxyip:80 -v -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" "https://www.very.co.uk"

Steps to Reproduce

  1. scrapy fetch "https://very.co.uk" -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" (with the proxy middleware installed / enabled)

Expected behavior: [What you expect to happen] 301 --> https://www.very.co.uk --> HTTP 200

Actual behavior: [What actually happens] 301 --> https://www.very.co.uk --> Dead / timeouts / connection failures

Reproduces how often: [What percentage of the time does it reproduce?] 100%

Versions

Scrapy : 2.4.1 lxml : 4.6.1.0 libxml2 : 2.9.5 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 20.3.0 Python : 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] pyOpenSSL : 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020) cryptography : 3.2.1 Platform : Windows-10-10.0.19041-SP0

mmotti avatar Dec 22 '20 20:12 mmotti

Is the issue reproducible setting any proxy using request.meta['proxy']?

Gallaecio avatar Dec 23 '20 10:12 Gallaecio

Thanks for your reply!

So long as the following is correct then I get the same result.

yield scrapy.Request(url=url, callback=self.parse, meta={'item': k, 'match_term': match_term, 'tables': tables, 'proxy': 'http://user:pass@ip:80'})

Crawl of https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100

2020-12-23 10:21:03 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx>
Traceback (most recent call last):
  File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\scrapy\core\downloader\middleware.py", line 45, in process_request
    return (yield download_func(request=request, spider=spider))
  File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\users\matt\appdata\local\programs\python\python38\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 375, in _cb_timeout
    raise TimeoutError(f"Getting {url} took longer than {timeout} seconds.")
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx took longer than 120.0 seconds..

Crawl of the same URL/spider without proxy

2020-12-23 10:29:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xsx> (referer: None)
2020-12-23 10:29:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?xss> (referer: None)
2020-12-23 10:29:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.very.co.uk/e/promo/shop-all-consoles.end?numProducts=100&?p5> (referer: None)
2020-12-23 10:29:14 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-23 10:29:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1031,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 172171,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'elapsed_time_seconds': 3.391417,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 12, 23, 10, 29, 14, 793605),
 'log_count/DEBUG': 33,
 'log_count/INFO': 8,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2020, 12, 23, 10, 29, 11, 402188)}
2020-12-23 10:29:14 [scrapy.core.engine] INFO: Spider closed (finished)

Curl of the same URL with the same proxy settings

<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js desktop ie6 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 7]> <html class="no-js desktop ie7 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 8]> <html class="no-js desktop ie8 oldie" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if IE 9]> <html class="no-js desktop ie9" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <![endif]-->
<!--[if gt IE 9]><!--> <html class="no-js desktop" lang="en" xmlns:fb="http://ogp.me/ns/fb#"> <!--<![endif]-->
<head>
<meta http-equiv="Content-Language" content="en" />
<title>Shop All Consoles | www.very.co.uk</title>
<link rel="canonical" href="https://www.very.co.uk/e/promo/shop-all-consoles.end" />
<meta name='description' content="Shop All Consoles at very.co.uk. Discover our huge range and get outstanding deals in the latest Shop All Consoles from very.co.uk."/>
<meta name='keywords' content="Shop All Consoles"/>
<meta name="robots" content="noindex,follow" />
<link rel="preconnect" href="http://css.very.co.uk/">
<link rel="preconnect" href="http://js.very.co.uk/">
<link rel="preconnect" href="http://content.very.co.uk/">
<link rel="preconnect" href="http://media.very.co.uk/">
<link rel="preconnect" href="http://speedtrap.shopdirect.com/">
<script>(function(){if(sessionStorage.sdFontsLoaded){document.documentElement.className+=" fonts-loaded";}}());</script>
<style>#headerWrap {position: relative;}</style>
<!-- ---------- All | All | D | Console Polyfill JS Slot Start ---------- -->
<script type="text/javascript">
(function(con) {
'use strict';
var prop, method;
var empty = {};
var dummy = function() {};
var properties = 'memory'.split(',');
var methods = ('assert,clear,count,debug,dir,dirxml,error,exception,group,' +
'groupCollapsed,groupEnd,info,log,markTimeline,profile,profileEnd,' +
'table,time,timeEnd,timeStamp,trace,warn').split(',');
while (prop = properties.pop()) {con[prop] = con[prop] || empty;}
while (method = methods.pop()) {con[method] = con[method] || dummy;}
})(window.console = window.console || {});
</script>

etc...

mmotti avatar Dec 23 '20 10:12 mmotti

Could you include the cURL command? I also wonder if using HTTPS in the proxy URL would make a different, but I may be talking nonsense here, it’s been a while since I’ve used a proxy this way.

Gallaecio avatar Dec 23 '20 15:12 Gallaecio

Could you include the cURL command? I also wonder if using HTTPS in the proxy URL would make a different, but I may be talking nonsense here, it’s been a while since I’ve used a proxy this way.

The Curl command is in the original post mate - If there's a way I can privately message you I am happy to give you these proxy details temporarily so you see what I see if you are unable to replicate.

Just to reiterate too - I experience the same kind of behaviour with simple user agent switcher middlewares too so the proxy may be a bit of a red herring. Really, really odd.

The proxy in question supports http/https though

mmotti avatar Dec 23 '20 15:12 mmotti

Does enabling/disabling cookies make any difference?

Gallaecio avatar Dec 23 '20 16:12 Gallaecio

Does enabling/disabling cookies make any difference?

Cookies had been left enabled as default however I have just tried after disabling them and sadly the result is the same

mmotti avatar Dec 23 '20 17:12 mmotti

Just to reiterate too - I experience the same kind of behaviour with simple user agent switcher middlewares too so the proxy may be a bit of a red herring. Really, really odd.

May this be only reproducible when sending multiple requests in a short period (e.g. during the same crawl session) to the same website? Or is it reproducible sending a single request with Scrapy?

Gallaecio avatar Feb 21 '21 15:02 Gallaecio

It's been a little while since I looked at this now but I believe it failed from request #1 - I couldn't make a successful request at all for specific sites unless I removed the proxy / user agent add-ons etc or remove the manual proxy settings.

mmotti avatar Feb 21 '21 16:02 mmotti

OK, I guess the only work left is for someone to reproduce the issue with their own proxy and one of the offending URLs (e.g. https://www.very.co.uk), and see if it’s reproducible with any proxy.

Gallaecio avatar Feb 21 '21 17:02 Gallaecio

I am working with proxies. I can replicate the issue and try to understand , if I can I will try to contribute the code.

Nahid93 avatar Mar 07 '21 18:03 Nahid93

Have you had any luck resoving this ? I am stuck on the same exact issue with different website, curl works , I can even browse the website while having the proxy enabled in firefox settings, but scrapy fails for some reason

devfox-se avatar Dec 02 '22 14:12 devfox-se

Have you had any luck resoving this ? I wasn't able to fix this. Haven't done any scraping for a while now though.

mmotti avatar Dec 02 '22 15:12 mmotti

I'm facing the same issue. Any update ?

Elias-SLH avatar Dec 08 '22 12:12 Elias-SLH

I may have a lead: it seems to work when the proxy scheme is https but does not work when it's http. Also it does not work with authenticated proxy adress.

Elias-SLH avatar Dec 08 '22 14:12 Elias-SLH