WarcMiddleware icon indicating copy to clipboard operation
WarcMiddleware copied to clipboard

failed to crawl

Open sergeospb opened this issue 11 years ago • 1 comments

python crawler.py --mirror --url "http://google.com/" 2013-03-23 09:26:46+0400 [scrapy] INFO: Scrapy 0.17.0 started (bot: crawltest) 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'crawltest.spiders', 'SPIDER_MODULES': ['crawltest.spiders'], 'DOWNLOADER_HTTPCLIENTFACTORY': 'warcclientfactory.WarcHTTPClientFactory', 'BOT_NAME': 'crawltest'} 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled item pipelines: 2013-03-23 09:26:46+0400 [simplespider] DEBUG: Using accept_netlocs: ['google.com'] 2013-03-23 09:26:46+0400 [simplespider] DEBUG: Crawling start_urls: ['http://google.com/'] 2013-03-23 09:26:46+0400 [simplespider] INFO: Spider opened 2013-03-23 09:26:46+0400 [simplespider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-03-23 09:26:46+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2013-03-23 09:26:46+0400 [Uninitialized] Unhandled Error Traceback (most recent call last): File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/log.py", line 88, in callWithLogger return callWithContext({"system": lp}, func, _args, *_kw) File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/log.py", line 73, in callWithContext return context.call({ILogContext: newCtx}, func, _args, *_kw) File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/context.py", line 118, in callWithContext return self.currentContext().callWithContext(ctx, func, _args, *_kw) File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext return func(args,*kw) --- --- File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py", line 619, in _doReadOrWrite why = selectable.doWrite() File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py", line 593, in doConnect self._connectDone() File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py", line 612, in _connectDone self.protocol.makeConnection(self) File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/protocol.py", line 460, in makeConnection self.connectionMade() File "/home/user/dist/WarcMiddleware/warcclientfactory.py", line 70, in connectionMade ScrapyHTTPPageGetter.connectionMade(self) File "/home/user/ENV/local/lib/python2.7/site-packages/Scrapy-0.17.0-py2.7.egg/scrapy/core/downloader/webclient.py", line 39, in connectionMade self.sendCommand(self.factory.method, self.factory.path) File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/web/http.py", line 401, in sendCommand self.transport.writeSequence([command, b' ', path, b' HTTP/1.0\r\n']) exceptions.AttributeError: StringIO instance has no attribute 'writeSequence'

2013-03-23 09:26:46+0400 [simplespider] ERROR: Error downloading <GET http://google.com/>: StringIO instance has no attribute 'writeSequence' 2013-03-23 09:26:46+0400 [simplespider] INFO: Closing spider (finished) 2013-03-23 09:26:46+0400 [simplespider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/exceptions.AttributeError': 1, 'downloader/request_bytes': 216, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 3, 23, 5, 26, 46, 379307), 'log_count/DEBUG': 9, 'log_count/ERROR': 2, 'log_count/INFO': 4, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2013, 3, 23, 5, 26, 46, 318817)} 2013-03-23 09:26:46+0400 [simplespider] INFO: Spider closed (finished)

sergeospb avatar Mar 23 '13 05:03 sergeospb

Both Twisted and Scrapy have changed their API which has affected WarcMiddleware. Until it is updated, please use Twisted-12.2.0 and Scrapy-0.16.3. I added installation instructions to the root directory as INSTALL.md.

Since you are on Linux/OS X, try running:

pip install twisted==12.2.0
pip install scrapy==0.16.3

Which should change your packages to versions that work with WarcMiddleware.

odie5533 avatar Oct 23 '13 20:10 odie5533