browsertrix-crawler
browsertrix-crawler copied to clipboard
Overwrite flag results in no such files error
When adding the --overwrite flag to the command, you get the following error:
crawl --url https://ipa2-f.kbc.be/particulieren/nl.html --limit 1 --generateWACZ --text --headless true --collection AEM --overwrite
Set netIdleWait to 2 seconds
wb-manager init failed, collection likely already exists
Clearing /crawls/collections/AEM before starting
Storing state in memory
pages/pages.jsonl creation failed [Error: ENOENT: no such file or directory, mkdir '/crawls/collections/AEM/pages'] {
errno: -2, 2023-01-24 11:09:25.126
code: 'ENOENT',3-01-24 11:09:28.106 (running for 3.0 seconds)
syscall: 'mkdir', (100.00%), errors: 0 (0.00%)
path: '/crawls/collections/AEM/pages'
}= Sys. load: 63.8% CPU / 36.7% memory
pages/pages.jsonl append failed TypeError: Cannot read properties of null (reading 'writeFile')
at Crawler.writePage (/app/crawler.js:977:26)
at Crawler.crawlPage (/app/crawler.js:392:18)r 18.9 seconds)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async /app/node_modules/puppeteer-cluster/dist/util.js:63:24
at async Object.timeoutExecute (/app/node_modules/puppeteer-cluster/dist/util.js:54:20)
at async Worker.handle (/app/node_modules/puppeteer-cluster/dist/Worker.js:48:22)
at async Cluster.doWork (/app/node_modules/puppeteer-cluster/dist/Cluster.js:250:24)
I think it's because the collection directory is not created when adding the overwrite flag. Not sure
Thanks @DriesVanbilloen ! It looks like the issue is that the collection is getting deleted because of --overwrite after wb-manager init is run, so the collection is never re-created. I will submit a PR with a fix shortly.