silently aborts before scraping all posts
This is a great tool, but it appears to silently abort long before scraping all posts. I'm attempting to scrape a site with over 20,000 posts, but every time I run the tool, it gives up after around 2000 posts, with no errors:
$ scrapy crawl phpBB -L INFO -t json -o data0.json
2020-09-12 20:27:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: phpBB_scraper)
2020-09-12 20:27:40 [scrapy.utils.log] INFO: Versions: lxml 4.4.0.0, libxml2 2.9.10, cssselect 0.9.2, parsel 1.5.0, w3lib 1.17.0, Twisted 19.2.1, Python 3.7.9 (default, Aug 19 2020, 17:05:11) - [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1g FIPS 21 Apr 2020), cryptography 2.6.1, Platform Linux-5.7.9-100.fc31.x86_64-x86_64-with-fedora-31-Thirty_One
2020-09-12 20:27:40 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'phpBB_scraper', 'DOWNLOAD_DELAY': 3.0, 'FEED_FORMAT': 'json', 'FEED_URI': 'data0.json', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'phpBB_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['phpBB_scraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.888'}
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-12 20:27:40 [scrapy.core.engine] INFO: Spider opened
2020-09-12 20:27:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-12 20:28:40 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 16 pages/min), scraped 0 items (at 0 items/min)
2020-09-12 20:29:40 [scrapy.extensions.logstats] INFO: Crawled 32 pages (at 16 pages/min), scraped 128 items (at 128 items/min)
2020-09-12 20:30:40 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 16 pages/min), scraped 266 items (at 138 items/min)
2020-09-12 20:31:40 [scrapy.extensions.logstats] INFO: Crawled 64 pages (at 16 pages/min), scraped 379 items (at 113 items/min)
2020-09-12 20:32:40 [scrapy.extensions.logstats] INFO: Crawled 80 pages (at 16 pages/min), scraped 500 items (at 121 items/min)
2020-09-12 20:33:40 [scrapy.extensions.logstats] INFO: Crawled 96 pages (at 16 pages/min), scraped 613 items (at 113 items/min)
2020-09-12 20:34:40 [scrapy.extensions.logstats] INFO: Crawled 114 pages (at 18 pages/min), scraped 700 items (at 87 items/min)
2020-09-12 20:35:40 [scrapy.extensions.logstats] INFO: Crawled 130 pages (at 16 pages/min), scraped 766 items (at 66 items/min)
2020-09-12 20:36:40 [scrapy.extensions.logstats] INFO: Crawled 146 pages (at 16 pages/min), scraped 825 items (at 59 items/min)
2020-09-12 20:37:40 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 17 pages/min), scraped 898 items (at 73 items/min)
2020-09-12 20:38:40 [scrapy.extensions.logstats] INFO: Crawled 178 pages (at 15 pages/min), scraped 945 items (at 47 items/min)
2020-09-12 20:39:40 [scrapy.extensions.logstats] INFO: Crawled 195 pages (at 17 pages/min), scraped 1021 items (at 76 items/min)
2020-09-12 20:40:40 [scrapy.extensions.logstats] INFO: Crawled 212 pages (at 17 pages/min), scraped 1157 items (at 136 items/min)
2020-09-12 20:41:40 [scrapy.extensions.logstats] INFO: Crawled 230 pages (at 18 pages/min), scraped 1332 items (at 175 items/min)
2020-09-12 20:42:40 [scrapy.extensions.logstats] INFO: Crawled 246 pages (at 16 pages/min), scraped 1446 items (at 114 items/min)
2020-09-12 20:43:40 [scrapy.extensions.logstats] INFO: Crawled 262 pages (at 16 pages/min), scraped 1552 items (at 106 items/min)
2020-09-12 20:44:40 [scrapy.extensions.logstats] INFO: Crawled 279 pages (at 17 pages/min), scraped 1657 items (at 105 items/min)
2020-09-12 20:45:40 [scrapy.extensions.logstats] INFO: Crawled 296 pages (at 17 pages/min), scraped 1768 items (at 111 items/min)
2020-09-12 20:46:40 [scrapy.extensions.logstats] INFO: Crawled 312 pages (at 16 pages/min), scraped 1863 items (at 95 items/min)
2020-09-12 20:47:40 [scrapy.extensions.logstats] INFO: Crawled 331 pages (at 19 pages/min), scraped 1947 items (at 84 items/min)
2020-09-12 20:48:40 [scrapy.extensions.logstats] INFO: Crawled 347 pages (at 16 pages/min), scraped 2015 items (at 68 items/min)
2020-09-12 20:49:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-12 20:49:30 [scrapy.extensions.feedexport] INFO: Stored json feed (2095 items) in: data0.json
2020-09-12 20:49:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 191249,
'downloader/request_count': 361,
'downloader/request_method_count/GET': 361,
'downloader/response_bytes': 2658616,
'downloader/response_count': 361,
'downloader/response_status_count/200': 360,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 13, 3, 49, 30, 639237),
'item_scraped_count': 2095,
'log_count/INFO': 29,
'memusage/max': 76652544,
'memusage/startup': 60194816,
'request_depth_max': 2,
'response_received_count': 360,
'scheduler/dequeued': 360,
'scheduler/dequeued/memory': 360,
'scheduler/enqueued': 360,
'scheduler/enqueued/memory': 360,
'start_time': datetime.datetime(2020, 9, 13, 3, 27, 40, 228263)}
2020-09-12 20:49:30 [scrapy.core.engine] INFO: Spider closed (finished)
@netllama Are you running this on a local machine or from a remote server? I ask because I ran into this issue running a crawl on an AWS EC2 instance, but the issue resolved itself when I ran the crawl locally and I was able to finish scraping the full result set.
I was running locally against a remote server (which was not in AWS).
On Thu, Oct 1, 2020 at 2:38 PM Dave Ascienzo [email protected] wrote:
@netllama https://github.com/netllama Are you running this on a local machine or from a remote server? I ask because I ran into this issue running a crawl on an AWS EC2 instance, but the issue resolved itself when I ran the crawl locally and I was able to finish scraping the full result set.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dascienz/phpBB-forum-scraper/issues/13#issuecomment-702412217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARLSNWPVV5ISB3HU2C7CDSITZHHANCNFSM4RKLKU7Q .
Did you find any success? I cannot receive any items at all.