help-scraper icon indicating copy to clipboard operation
help-scraper copied to clipboard

Improve performance of AWS script

Open simonw opened this issue 4 years ago • 4 comments

This works now! Would be good to make it faster though.

Originally posted by @simonw in https://github.com/simonw/help-scraper/issues/2#issuecomment-1024982726

simonw avatar Jan 29 '22 20:01 simonw

It takes 2hr45m right now. That's a long time, especially if I want to run it every day! Feels like a poor use of GitHub Actions resources.

simonw avatar Jan 29 '22 20:01 simonw

Some options:

  • Use threads or processes to run some of the tasks in parallel - not sure how many vCPUs GitHub Actions gives me though so this may not make much of a difference
  • Check the version first and only run the crawl if it has changed since last time. This would definitely be worthwhile.
  • Dig into the Python implementation of awscli and see if I can call help while avoiding the overhead of starting up a fresh process for every single page

simonw avatar Jan 29 '22 20:01 simonw

Worth considering: I'm currently using the aws CLI that ships with the GitHub Actions worker.

According to https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-Readme.md that's currently AWS CLI 2.4.13 - and that gets bumped pretty often, see the commit history here https://github.com/actions/virtual-environments/commits/main/images/linux/Ubuntu2004-Readme.md which seems to bump it every few days.

But the release history on https://pypi.org/project/awscli/#history shows daily releases of AWS CLI - so actually I should update to the latest version using pip install -U rather than relying on the built-in one.

simonw avatar Jan 29 '22 20:01 simonw

https://superfastpython.com/threadpoolexecutor-in-python/ looks useful.

Might also be interesting to try doing this with asyncio and https://docs.python.org/3/library/asyncio-eventloop.html#running-subprocesses

simonw avatar Feb 13 '22 12:02 simonw