osm-analytics-cruncher
osm-analytics-cruncher copied to clipboard
Strategy for optimizing computing resources
Research and compare options for optimizing resources needed to run the cruncher. This may include using spot instances, using a dedicated server, parallelizing processes, starting/stopping EC2 as needed.
A first step could be to define the infrastructure required by the project as a Cloudformation template. This would enable us to have visibility of the current infrastructure deployed, assess budget, and find opportunities for optimization. Related to #5.
@tyrasd - would you have time to help with this? I think just creating the overall system diagram could be enough for @rodowi to translate into a Cloudformation template?
@tyrasd besides EC2 and S3, is there any other type of AWS resources that this project depends on?
Another benefit of defining a CloudFormation template is that we get "replicability" of the whole deployment process. I'd be happy to help on this.
We could save lots of resources by serving the tiles from Mapbox infrastructure instead of spinning up a local tile server. The frontend is using Mapbox as a basemap, but still loads vector tiles from the cruncher server.
If we would remove the requirement for serving tiles, this EC2 instance could be running batch jobs on spot instances once a day saving a lot of money, i.e. this will run only during UTC nights on cheap computers.
Thanks @rodowi - can we talk how we can implement your suggestions? Is this something you can setup in parallel and then we switch over to the new source when ready and tested?
cc @dodobas
@cgiovando yes, def we should work this in parallel
This is what I see as next steps for testing this out:
- [x] I will add an optional script that uploads tilesets to the HOT account on Mapbox API (I can use my account for now) and skips running the server.
- [x] Will fork the frontend and work on loading the tiles from this Mapbox account.
- [ ] Send PRs to both the cruncher and the frontend.
- [ ] Save money.
Notes:
- We should def keep the current tile server as it works for deployments outside of Mapbox. So we'll provide an optional flag to run on this new setup.
Save money.
OK, don't get me wrong, but how does this save us any money overall? Please correct me if I'm wrong, but as far as I know, Mapbox hosts uploaded mbtiles as individual tiles on S3, right? And as S3 charges per request, our daily ~4 million tile uploads would cost us about 20 USD per day in PUT-request charges alone. While leaving the r3.2xlarge EC2 instance running 24/7 currently costs us only 9 USD per day.
related issue: #7
but how does this save us any money overall?
By uploading tiles directly to the Mapbox API we skip S3, saving those 20 USD (4M * 0.005 / 1000 = ~20 USD).
By serving tiles from Mapbox we can decrease the uptime of the EC2 instance to one hour a day. This means that the instance will only be used for data crunching (lowering the cost to 4% of the current bill) <- assuming we improve the script performance.
We are talking about ~800 USD a month, which IMO is not a negligible cost for a non-profit.
Besides saving some bucks, we will also benefit from having less moving parts to maintain.
OK, let's do this then! :)
It would probably make sense to start by uploading the static "historic" snapshots first.
uploading the static "historic" snapshots first.
what are those?
I was thinking on the following steps to get started:
- running a global script in another ec2 instance (or just copy the mbtiles files from the server - if anyone has ssh access).
- upload buildings, hotprojects, and highways to Mapbox
- make adjustments to the Map component in the frontend
- prepare demo on a forked github page
- once it works add the upload script to run.sh (so it runs on a daily basis)
- submit changes from step 3 to the frontend repo
- deploy map styles loaded from the Mapbox API
does it make sense?
@tyrasd, I'm also really curious about the 'historic' snapshot, are you talking about the annual years?
Yes, I meant the annual data that's used in the "compare time periods" slider-map.