WarcMiddleware
WarcMiddleware copied to clipboard
failed to crawl
python crawler.py --mirror --url "http://google.com/"
2013-03-23 09:26:46+0400 [scrapy] INFO: Scrapy 0.17.0 started (bot: crawltest)
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'crawltest.spiders', 'SPIDER_MODULES': ['crawltest.spiders'], 'DOWNLOADER_HTTPCLIENTFACTORY': 'warcclientfactory.WarcHTTPClientFactory', 'BOT_NAME': 'crawltest'}
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-23 09:26:46+0400 [simplespider] DEBUG: Using accept_netlocs: ['google.com']
2013-03-23 09:26:46+0400 [simplespider] DEBUG: Crawling start_urls: ['http://google.com/']
2013-03-23 09:26:46+0400 [simplespider] INFO: Spider opened
2013-03-23 09:26:46+0400 [simplespider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-23 09:26:46+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-23 09:26:46+0400 [Uninitialized] Unhandled Error
Traceback (most recent call last):
File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/log.py", line 88, in callWithLogger
return callWithContext({"system": lp}, func, _args, *_kw)
File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/log.py", line 73, in callWithContext
return context.call({ILogContext: newCtx}, func, _args, *_kw)
File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, _args, *_kw)
File "/home/user/ENV/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext
return func(args,*kw)
---
2013-03-23 09:26:46+0400 [simplespider] ERROR: Error downloading <GET http://google.com/>: StringIO instance has no attribute 'writeSequence' 2013-03-23 09:26:46+0400 [simplespider] INFO: Closing spider (finished) 2013-03-23 09:26:46+0400 [simplespider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/exceptions.AttributeError': 1, 'downloader/request_bytes': 216, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 3, 23, 5, 26, 46, 379307), 'log_count/DEBUG': 9, 'log_count/ERROR': 2, 'log_count/INFO': 4, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2013, 3, 23, 5, 26, 46, 318817)} 2013-03-23 09:26:46+0400 [simplespider] INFO: Spider closed (finished)
Both Twisted and Scrapy have changed their API which has affected WarcMiddleware. Until it is updated, please use Twisted-12.2.0 and Scrapy-0.16.3. I added installation instructions to the root directory as INSTALL.md.
Since you are on Linux/OS X, try running:
pip install twisted==12.2.0
pip install scrapy==0.16.3
Which should change your packages to versions that work with WarcMiddleware.