browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Overwrite flag results in no such files error

Open DriesVanbilloen opened this issue 3 years ago • 1 comments

When adding the --overwrite flag to the command, you get the following error:

crawl --url https://ipa2-f.kbc.be/particulieren/nl.html  --limit 1 --generateWACZ --text --headless true --collection AEM --overwrite
Set netIdleWait to 2 seconds
wb-manager init failed, collection likely already exists
Clearing /crawls/collections/AEM before starting
Storing state in memory
pages/pages.jsonl creation failed [Error: ENOENT: no such file or directory, mkdir '/crawls/collections/AEM/pages'] {
  errno: -2,  2023-01-24 11:09:25.126
  code: 'ENOENT',3-01-24 11:09:28.106 (running for 3.0 seconds)
  syscall: 'mkdir', (100.00%), errors: 0 (0.00%)
  path: '/crawls/collections/AEM/pages'
}= Sys. load: 63.8% CPU / 36.7% memory
pages/pages.jsonl append failed TypeError: Cannot read properties of null (reading 'writeFile')
    at Crawler.writePage (/app/crawler.js:977:26)
    at Crawler.crawlPage (/app/crawler.js:392:18)r 18.9 seconds)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async /app/node_modules/puppeteer-cluster/dist/util.js:63:24
    at async Object.timeoutExecute (/app/node_modules/puppeteer-cluster/dist/util.js:54:20)
    at async Worker.handle (/app/node_modules/puppeteer-cluster/dist/Worker.js:48:22)
    at async Cluster.doWork (/app/node_modules/puppeteer-cluster/dist/Cluster.js:250:24)

I think it's because the collection directory is not created when adding the overwrite flag. Not sure

DriesVanbilloen avatar Jan 24 '23 11:01 DriesVanbilloen

Thanks @DriesVanbilloen ! It looks like the issue is that the collection is getting deleted because of --overwrite after wb-manager init is run, so the collection is never re-created. I will submit a PR with a fix shortly.

tw4l avatar Feb 02 '23 23:02 tw4l