phpBB-forum-scraper icon indicating copy to clipboard operation
phpBB-forum-scraper copied to clipboard

silently aborts before scraping all posts

Open netllama opened this issue 5 years ago • 3 comments

This is a great tool, but it appears to silently abort long before scraping all posts. I'm attempting to scrape a site with over 20,000 posts, but every time I run the tool, it gives up after around 2000 posts, with no errors:

$ scrapy crawl phpBB -L INFO -t json -o data0.json
2020-09-12 20:27:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: phpBB_scraper)
2020-09-12 20:27:40 [scrapy.utils.log] INFO: Versions: lxml 4.4.0.0, libxml2 2.9.10, cssselect 0.9.2, parsel 1.5.0, w3lib 1.17.0, Twisted 19.2.1, Python 3.7.9 (default, Aug 19 2020, 17:05:11) - [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1g FIPS  21 Apr 2020), cryptography 2.6.1, Platform Linux-5.7.9-100.fc31.x86_64-x86_64-with-fedora-31-Thirty_One
2020-09-12 20:27:40 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'phpBB_scraper', 'DOWNLOAD_DELAY': 3.0, 'FEED_FORMAT': 'json', 'FEED_URI': 'data0.json', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'phpBB_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['phpBB_scraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.888'}
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-12 20:27:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-12 20:27:40 [scrapy.core.engine] INFO: Spider opened
2020-09-12 20:27:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-12 20:28:40 [scrapy.extensions.logstats] INFO: Crawled 16 pages (at 16 pages/min), scraped 0 items (at 0 items/min)
2020-09-12 20:29:40 [scrapy.extensions.logstats] INFO: Crawled 32 pages (at 16 pages/min), scraped 128 items (at 128 items/min)
2020-09-12 20:30:40 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 16 pages/min), scraped 266 items (at 138 items/min)
2020-09-12 20:31:40 [scrapy.extensions.logstats] INFO: Crawled 64 pages (at 16 pages/min), scraped 379 items (at 113 items/min)
2020-09-12 20:32:40 [scrapy.extensions.logstats] INFO: Crawled 80 pages (at 16 pages/min), scraped 500 items (at 121 items/min)
2020-09-12 20:33:40 [scrapy.extensions.logstats] INFO: Crawled 96 pages (at 16 pages/min), scraped 613 items (at 113 items/min)
2020-09-12 20:34:40 [scrapy.extensions.logstats] INFO: Crawled 114 pages (at 18 pages/min), scraped 700 items (at 87 items/min)
2020-09-12 20:35:40 [scrapy.extensions.logstats] INFO: Crawled 130 pages (at 16 pages/min), scraped 766 items (at 66 items/min)
2020-09-12 20:36:40 [scrapy.extensions.logstats] INFO: Crawled 146 pages (at 16 pages/min), scraped 825 items (at 59 items/min)
2020-09-12 20:37:40 [scrapy.extensions.logstats] INFO: Crawled 163 pages (at 17 pages/min), scraped 898 items (at 73 items/min)
2020-09-12 20:38:40 [scrapy.extensions.logstats] INFO: Crawled 178 pages (at 15 pages/min), scraped 945 items (at 47 items/min)
2020-09-12 20:39:40 [scrapy.extensions.logstats] INFO: Crawled 195 pages (at 17 pages/min), scraped 1021 items (at 76 items/min)
2020-09-12 20:40:40 [scrapy.extensions.logstats] INFO: Crawled 212 pages (at 17 pages/min), scraped 1157 items (at 136 items/min)
2020-09-12 20:41:40 [scrapy.extensions.logstats] INFO: Crawled 230 pages (at 18 pages/min), scraped 1332 items (at 175 items/min)
2020-09-12 20:42:40 [scrapy.extensions.logstats] INFO: Crawled 246 pages (at 16 pages/min), scraped 1446 items (at 114 items/min)
2020-09-12 20:43:40 [scrapy.extensions.logstats] INFO: Crawled 262 pages (at 16 pages/min), scraped 1552 items (at 106 items/min)
2020-09-12 20:44:40 [scrapy.extensions.logstats] INFO: Crawled 279 pages (at 17 pages/min), scraped 1657 items (at 105 items/min)
2020-09-12 20:45:40 [scrapy.extensions.logstats] INFO: Crawled 296 pages (at 17 pages/min), scraped 1768 items (at 111 items/min)
2020-09-12 20:46:40 [scrapy.extensions.logstats] INFO: Crawled 312 pages (at 16 pages/min), scraped 1863 items (at 95 items/min)
2020-09-12 20:47:40 [scrapy.extensions.logstats] INFO: Crawled 331 pages (at 19 pages/min), scraped 1947 items (at 84 items/min)
2020-09-12 20:48:40 [scrapy.extensions.logstats] INFO: Crawled 347 pages (at 16 pages/min), scraped 2015 items (at 68 items/min)
2020-09-12 20:49:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-12 20:49:30 [scrapy.extensions.feedexport] INFO: Stored json feed (2095 items) in: data0.json
2020-09-12 20:49:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 191249,
 'downloader/request_count': 361,
 'downloader/request_method_count/GET': 361,
 'downloader/response_bytes': 2658616,
 'downloader/response_count': 361,
 'downloader/response_status_count/200': 360,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 13, 3, 49, 30, 639237),
 'item_scraped_count': 2095,
 'log_count/INFO': 29,
 'memusage/max': 76652544,
 'memusage/startup': 60194816,
 'request_depth_max': 2,
 'response_received_count': 360,
 'scheduler/dequeued': 360,
 'scheduler/dequeued/memory': 360,
 'scheduler/enqueued': 360,
 'scheduler/enqueued/memory': 360,
 'start_time': datetime.datetime(2020, 9, 13, 3, 27, 40, 228263)}
2020-09-12 20:49:30 [scrapy.core.engine] INFO: Spider closed (finished)

netllama avatar Sep 13 '20 04:09 netllama

@netllama Are you running this on a local machine or from a remote server? I ask because I ran into this issue running a crawl on an AWS EC2 instance, but the issue resolved itself when I ran the crawl locally and I was able to finish scraping the full result set.

Dascienz avatar Oct 01 '20 21:10 Dascienz

I was running locally against a remote server (which was not in AWS).

On Thu, Oct 1, 2020 at 2:38 PM Dave Ascienzo [email protected] wrote:

@netllama https://github.com/netllama Are you running this on a local machine or from a remote server? I ask because I ran into this issue running a crawl on an AWS EC2 instance, but the issue resolved itself when I ran the crawl locally and I was able to finish scraping the full result set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dascienz/phpBB-forum-scraper/issues/13#issuecomment-702412217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARLSNWPVV5ISB3HU2C7CDSITZHHANCNFSM4RKLKU7Q .

netllama avatar Oct 01 '20 21:10 netllama

Did you find any success? I cannot receive any items at all.

NoobBugHunter avatar Dec 10 '20 11:12 NoobBugHunter