browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Unable to resume crawl: not valid JSON

Open edsu opened this issue 1 year ago • 2 comments

When resuming a crawl I noticed that passing --url seemed necessary, which seemed counter intuitve. Once I did this however I got a JSON parse error?

docker run -v $PWD/crawls:/crawls/  -p 9037:9037 webrecorder/browsertrix-crawler  crawl \
  --url http://www.buffon.cnrs.fr/  \
  --config /crawls/collections/buffon/crawls/crawl-20240510011214-be3552cdb6da.yaml

{"timestamp":"2024-05-10T01:25:52.694Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.2 (with warcio.js 2.2.1)","details":{}}
{"timestamp":"2024-05-10T01:25:52.696Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"http://www.buffon.cnrs.fr/","scopeType":"prefix","include":["/^https?:\\/\\/www\\.buffon\\.cnrs\\.fr\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":1000000}]}
{"timestamp":"2024-05-10T01:25:53.334Z","logLevel":"error","context":"general","message":"Crawl failed","details":{"type":"exception","message":"\"[object Object]\" is not valid JSON","stack":"SyntaxError: \"[object Object]\" is not valid JSON\n    at JSON.parse (<anonymous>)\n    at RedisCrawlState.load (file:///app/dist/util/state.js:451:31)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Crawler.initCrawlState (file:///app/dist/crawler.js:218:13)\n    at async Crawler.crawl (file:///app/dist/crawler.js:797:9)\n    at async Crawler.run (file:///app/dist/crawler.js:342:13)"}}
{"timestamp":"2024-05-10T01:25:53.335Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: failing","details":{}}

I've attached my crawl state YAML here, in case it's helpful:

crawl-20240510011214-be3552cdb6da.yaml.gz

edsu avatar May 10 '24 01:05 edsu

The JSON parse error is indeed a bug, to be fixed with #576.

Passing the --url flag is needed as it was passed originally on the command-line. The way this works is that any CLI arguments are not included in the save state, as it is assumed they would be passed in again via command-line.

If original call provides --config that sets url/seeds, other settings, then the saved config also includes those settings. It was done this way to make it simpler to add a saved file w/o changing other arguments, but perhaps should reconsider this, since CLI args do override config anyway.

ikreymer avatar May 17 '24 00:05 ikreymer

Thanks, this makes more sense now!

edsu avatar May 17 '24 10:05 edsu