browsertrix-crawler make browsertrix-crawler runnable in serverless environments

make browsertrix-crawler runnable in serverless environments

Open msramalho opened this issue 1 year ago • 10 comments

Hi all,

I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.

The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the /tmp directory and no other. For browsertrix-crawler outputs the --cwd option should solve it but it's still trying to write to .local (maybe that's playwright/redis or some other dependency?).

So the current issue error I get is:

mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
{
    "logLevel": "warn",
    "context": "redis",
    "message": "ioredis error",
    "details": {
        "error": "[ioredis] Unhandled error event:"
    }
}
{
    "logLevel": "warn",
    "context": "state",
    "message": "Waiting for redis at redis://localhost:6379/0",
    "details": {}
}
{
    "logLevel": "error",
    "context": "general",
    "message": "Crawl failed",
    "details": {
        "type": "exception",
        "message": "Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!",
        "stack": "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!\n    at ChromeLauncher.launch (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:123:23)\n    at async Browser._init (file:///app/util/browser.js:236:20)\n    at async Browser.launch (file:///app/util/browser.js:61:5)\n    at async Crawler.crawl (file:///app/crawler.js:821:5)\n    at async Crawler.run (file:///app/crawler.js:311:7)"
    }
}

and this is the version info

{
    "logLevel": "info",
    "context": "general",
    "message": "Browsertrix-Crawler 0.11.2 (with warcio.js 1.6.2 pywb 2.7.4)",
    "details": {}
}

I've put the Dockerfile and lambda_function.py in this gist you can use it if you want to replicate the issue.

For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html And I'm using the API gateway to make testing quick:

Dec 11 '23 12:12 msramalho

browsertrix-crawler browsertrix-crawler copied to clipboard

make browsertrix-crawler runnable in serverless environments

browsertrix-crawler
browsertrix-crawler copied to clipboard