Unable to resume crawl: not valid JSON
When resuming a crawl I noticed that passing --url seemed necessary, which seemed counter intuitve. Once I did this however I got a JSON parse error?
docker run -v $PWD/crawls:/crawls/ -p 9037:9037 webrecorder/browsertrix-crawler crawl \
--url http://www.buffon.cnrs.fr/ \
--config /crawls/collections/buffon/crawls/crawl-20240510011214-be3552cdb6da.yaml
{"timestamp":"2024-05-10T01:25:52.694Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.1.2 (with warcio.js 2.2.1)","details":{}}
{"timestamp":"2024-05-10T01:25:52.696Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"http://www.buffon.cnrs.fr/","scopeType":"prefix","include":["/^https?:\\/\\/www\\.buffon\\.cnrs\\.fr\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"maxExtraHops":0,"maxDepth":1000000}]}
{"timestamp":"2024-05-10T01:25:53.334Z","logLevel":"error","context":"general","message":"Crawl failed","details":{"type":"exception","message":"\"[object Object]\" is not valid JSON","stack":"SyntaxError: \"[object Object]\" is not valid JSON\n at JSON.parse (<anonymous>)\n at RedisCrawlState.load (file:///app/dist/util/state.js:451:31)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async Crawler.initCrawlState (file:///app/dist/crawler.js:218:13)\n at async Crawler.crawl (file:///app/dist/crawler.js:797:9)\n at async Crawler.run (file:///app/dist/crawler.js:342:13)"}}
{"timestamp":"2024-05-10T01:25:53.335Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: failing","details":{}}
I've attached my crawl state YAML here, in case it's helpful:
The JSON parse error is indeed a bug, to be fixed with #576.
Passing the --url flag is needed as it was passed originally on the command-line. The way this works is that any CLI arguments are not included in the save state, as it is assumed they would be passed in again via command-line.
If original call provides --config that sets url/seeds, other settings, then the saved config also includes those settings.
It was done this way to make it simpler to add a saved file w/o changing other arguments, but perhaps should reconsider this, since CLI args do override config anyway.
Thanks, this makes more sense now!