Scrapy_IPProxyPool icon indicating copy to clipboard operation
Scrapy_IPProxyPool copied to clipboard

出了点问题

Open 101142TS opened this issue 5 years ago • 0 comments

得到的ip无法爬取网站,我想要爬取wandoujia,但得到的ip访问时timeout

/Users/icst/Desktop/test_proxy/wandoujia/proxyPool/ProxyPoolWorker.py:81: SyntaxWarning: "is not" with a literal. Did you mean "!="? if proxy is not '': /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (1681, b'Integer display width is deprecated and will be removed in a future release.') result = self._query(query) /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pymysql/cursors.py:170: Warning: (3719, b"'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.") result = self._query(query) 正在爬取快代理…… 115.216.56.92 | 9999 | 高匿名 | HTTP | 浙江省杭州市 电信 | 3秒 123.149.136.127 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒 111.72.25.153 | 9999 | 高匿名 | HTTP | 江西省抚州市 电信 | 0.5秒 183.166.111.11 | 9999 | 高匿名 | HTTP | 安徽省淮南市 电信 | 2秒 171.35.211.234 | 9999 | 高匿名 | HTTP | 江西省新余市 联通 | 3秒 114.239.110.93 | 9999 | 高匿名 | HTTP | 江苏省宿迁市 电信 | 2秒 110.243.2.58 | 9999 | 高匿名 | HTTP | 河北省唐山市 联通 | 2秒 114.99.22.104 | 9999 | 高匿名 | HTTP | 安徽省铜陵市 电信 | 2秒 124.113.250.171 | 9999 | 高匿名 | HTTP | 安徽省宿州市 电信 | 3秒 123.149.141.209 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 1秒 183.146.156.254 | 9999 | 高匿名 | HTTP | 浙江省金华市 电信 | 0.7秒 123.149.136.121 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 3秒 163.204.247.139 | 9999 | 高匿名 | HTTP | 广东省汕尾市 联通 | 1秒 123.163.27.220 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.8秒 1.196.177.218 | 9999 | 高匿名 | HTTP | 河南省洛阳市 电信 | 0.7秒 2020-02-09 23:15:11 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: wandoujia) 2020-02-09 23:15:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.1 (v3.8.1:1b293b6006, Dec 18 2019, 14:08:53) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform macOS-10.14.1-x86_64-i386-64bit 2020-02-09 23:15:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'wandoujia', 'COOKIES_ENABLED': False, 'NEWSPIDER_MODULE': 'wandoujia.spiders', 'SPIDER_MODULES': ['wandoujia.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'} 2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet Password: 79f3a3cb43e725d1 2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled downloader middlewares: ['proxyPool.scrapy.middlewares.RetryMiddleware', 'proxyPool.scrapy.middlewares.ProxyMiddleware', 'proxyPool.scrapy.middlewares.CatchExceptionMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'wandoujia.middlewares.WandoujiaDownloaderMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-02-09 23:15:11 [scrapy.middleware] INFO: Enabled item pipelines: ['wandoujia.pipelines.MyFilesPipeline'] 2020-02-09 23:15:11 [scrapy.core.engine] INFO: Spider opened 2020-02-09 23:15:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-02-09 23:15:11 [main] INFO: Spider opened: main 2020-02-09 23:15:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-02-09 23:15:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://123.149.136.121:9999 】 ===== 2020-02-09 23:16:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-02-09 23:17:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-02-09 23:18:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-02-09 23:18:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 1 times): User timeout caused connection failure: Getting https://www.wandoujia.com/apps/665777 took longer than 180.0 seconds.. 2020-02-09 23:18:11 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://110.243.2.58:9999 】 ===== 2020-02-09 23:19:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.wandoujia.com/apps/665777> (failed 2 times): TCP connection timed out: 60: Operation timed out. 2020-02-09 23:19:27 [root] DEBUG: ===== ProxyMiddleware get a random_proxy:【 http://1.196.177.218:9999 】 ===== 2020-02-09 23:19:27 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.wandoujia.com/apps/665777> (failed 3 times): Connection was refused by other side: 61: Connection refused. 2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy === 2020-02-09 23:19:27 [root] DEBUG: === success to update 1.196.177.218 proxy === 2020-02-09 23:19:27 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.wandoujia.com/apps/665777> Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request defer.returnValue((yield download_func(request=request, spider=spider))) twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 61: Connection refused. 2020-02-09 23:19:27 [scrapy.core.engine] INFO: Closing spider (finished) 2020-02-09 23:19:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1, 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 1, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 1, 'downloader/request_bytes': 918, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'elapsed_time_seconds': 256.041098, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 2, 9, 15, 19, 27, 373921), 'log_count/DEBUG': 8, 'log_count/ERROR': 1, 'log_count/INFO': 15, 'memusage/max': 67170304, 'memusage/startup': 66805760, 'retry/count': 2, 'retry/max_reached': 1, 'retry/reason_count/twisted.internet.error.TCPTimedOutError': 1, 'retry/reason_count/twisted.internet.error.TimeoutError': 1, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2020, 2, 9, 15, 15, 11, 332823)} 2020-02-09 23:19:27 [scrapy.core.engine] INFO: Spider closed (finished)

101142TS avatar Feb 09 '20 15:02 101142TS