browsertrix-crawler
browsertrix-crawler copied to clipboard
Created Invalid WARC record
The following run completed, reaching the size threshold.
Running browsertrix-crawler crawl: crawl --waitUntil load --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --diskUtilization 90 --timeLimit 7200 --url https://www.wikidata.org/wiki/Q4414 --userAgentSuffix [email protected] --cwd /output/.tmp0lehhxlr --statsFilename /output/crawl.json
{"logLevel":"info","timestamp":"2023-05-22T06:50:10.260Z","context":"general","message":"Browsertrix-Crawler 0.10.0-beta.0 (with warcio.js 1.6.2 pywb 2.7.3)","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T06:50:10.264Z","context":"general","message":"Seeds","details":[{"url":"https://www.wikidata.org/wiki/Q4414","include":["/^https?:\\/\\/www\\.wikidata\\.org\\/wiki\\//"],"exclude":[],"scopeType":"prefix","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
...
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.817Z","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.wikidata.org/wiki/Q63144794","workerid":0}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.821Z","context":"general","message":"Size threshold reached 4302013563 >= 4294967296, stopping","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.844Z","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:00.844Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.484Z","context":"general","message":"Saving crawl state to: /output/.tmp0lehhxlr/collections/crawl-20230522065007735/crawls/crawl-20230522081200-177caf49d5fa.yaml","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":977,"total":22667,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.712Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-05-22T08:12:01.754Z","context":"general","message":"Crawl status: interrupted","details":{}}
For some reason, some produced WARC are invalid (not readable via warcio)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/warcio/recordloader.py", line 224, in _detect_type_load_headers
rec_headers = self.warc_parser.parse(stream, statusline)
File "/usr/local/lib/python3.10/dist-packages/warcio/statusandheaders.py", line 270, in parse
raise StatusAndHeadersParserException(msg, full_statusline)
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: ÄiPQFNÕ¾À¨|%MÑ5"¥Ø%ÍKÏä
With 0.10.0-beta.0
This should no longer be happening in 1.1.x and forward - we added some checks to make sure that the WARC records are written before the crawler shuts down as long as it's a graceful shut down. Feel free to reopen if you encounter this again!