browsertrix-crawler
browsertrix-crawler copied to clipboard
make browsertrix-crawler runnable in serverless environments
Hi all,
I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.
The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the /tmp
directory and no other. For browsertrix-crawler outputs the --cwd
option should solve it but it's still trying to write to .local
(maybe that's playwright/redis or some other dependency?).
So the current issue error I get is:
mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
{
"logLevel": "warn",
"context": "redis",
"message": "ioredis error",
"details": {
"error": "[ioredis] Unhandled error event:"
}
}
{
"logLevel": "warn",
"context": "state",
"message": "Waiting for redis at redis://localhost:6379/0",
"details": {}
}
{
"logLevel": "error",
"context": "general",
"message": "Crawl failed",
"details": {
"type": "exception",
"message": "Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!",
"stack": "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!\n at ChromeLauncher.launch (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:123:23)\n at async Browser._init (file:///app/util/browser.js:236:20)\n at async Browser.launch (file:///app/util/browser.js:61:5)\n at async Crawler.crawl (file:///app/crawler.js:821:5)\n at async Crawler.run (file:///app/crawler.js:311:7)"
}
}
and this is the version info
{
"logLevel": "info",
"context": "general",
"message": "Browsertrix-Crawler 0.11.2 (with warcio.js 1.6.2 pywb 2.7.4)",
"details": {}
}
I've put the Dockerfile
and lambda_function.py
in this gist you can use it if you want to replicate the issue.
For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html
And I'm using the API gateway to make testing quick: