Improve performance of AWS script
This works now! Would be good to make it faster though.
Originally posted by @simonw in https://github.com/simonw/help-scraper/issues/2#issuecomment-1024982726
It takes 2hr45m right now. That's a long time, especially if I want to run it every day! Feels like a poor use of GitHub Actions resources.
Some options:
- Use threads or processes to run some of the tasks in parallel - not sure how many vCPUs GitHub Actions gives me though so this may not make much of a difference
- Check the version first and only run the crawl if it has changed since last time. This would definitely be worthwhile.
- Dig into the Python implementation of
awscliand see if I can call help while avoiding the overhead of starting up a fresh process for every single page
Worth considering: I'm currently using the aws CLI that ships with the GitHub Actions worker.
According to https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md that's currently AWS CLI 2.4.13 - and that gets bumped pretty often, see the commit history here https://github.com/actions/virtual-environments/commits/main/images/linux/Ubuntu2004-Readme.md which seems to bump it every few days.
But the release history on https://pypi.org/project/awscli/#history shows daily releases of AWS CLI - so actually I should update to the latest version using pip install -U rather than relying on the built-in one.
https://superfastpython.com/threadpoolexecutor-in-python/ looks useful.
Might also be interesting to try doing this with asyncio and https://docs.python.org/3/library/asyncio-eventloop.html#running-subprocesses