osm-analytics-cruncher Strategy for optimizing computing resources

Research and compare options for optimizing resources needed to run the cruncher. This may include using spot instances, using a dedicated server, parallelizing processes, starting/stopping EC2 as needed.

May 11 '16 19:05 cgiovando

A first step could be to define the infrastructure required by the project as a Cloudformation template. This would enable us to have visibility of the current infrastructure deployed, assess budget, and find opportunities for optimization. Related to #5.

Jun 16 '16 18:06 defvol

@tyrasd - would you have time to help with this? I think just creating the overall system diagram could be enough for @rodowi to translate into a Cloudformation template?

Jun 16 '16 23:06 cgiovando

@tyrasd besides EC2 and S3, is there any other type of AWS resources that this project depends on?

Jun 17 '16 15:06 defvol

Another benefit of defining a CloudFormation template is that we get "replicability" of the whole deployment process. I'd be happy to help on this.

Jun 17 '16 22:06 defvol

We could save lots of resources by serving the tiles from Mapbox infrastructure instead of spinning up a local tile server. The frontend is using Mapbox as a basemap, but still loads vector tiles from the cruncher server.

Jun 24 '16 16:06 defvol

If we would remove the requirement for serving tiles, this EC2 instance could be running batch jobs on spot instances once a day saving a lot of money, i.e. this will run only during UTC nights on cheap computers.

Jun 24 '16 16:06 defvol

Thanks @rodowi - can we talk how we can implement your suggestions? Is this something you can setup in parallel and then we switch over to the new source when ready and tested?

cc @dodobas

Aug 01 '16 18:08 cgiovando

@cgiovando yes, def we should work this in parallel

This is what I see as next steps for testing this out:

[x] I will add an optional script that uploads tilesets to the HOT account on Mapbox API (I can use my account for now) and skips running the server.
[x] Will fork the frontend and work on loading the tiles from this Mapbox account.
[ ] Send PRs to both the cruncher and the frontend.
[ ] Save money.

Notes:

We should def keep the current tile server as it works for deployments outside of Mapbox. So we'll provide an optional flag to run on this new setup.

Sep 07 '16 20:09 defvol

Save money.

OK, don't get me wrong, but how does this save us any money overall? Please correct me if I'm wrong, but as far as I know, Mapbox hosts uploaded mbtiles as individual tiles on S3, right? And as S3 charges per request, our daily ~4 million tile uploads would cost us about 20 USD per day in PUT-request charges alone. While leaving the r3.2xlarge EC2 instance running 24/7 currently costs us only 9 USD per day.

related issue: #7

Sep 08 '16 08:09 tyrasd

but how does this save us any money overall?

By uploading tiles directly to the Mapbox API we skip S3, saving those 20 USD (4M * 0.005 / 1000 = ~20 USD).

By serving tiles from Mapbox we can decrease the uptime of the EC2 instance to one hour a day. This means that the instance will only be used for data crunching (lowering the cost to 4% of the current bill) <- assuming we improve the script performance.

We are talking about ~800 USD a month, which IMO is not a negligible cost for a non-profit.

Besides saving some bucks, we will also benefit from having less moving parts to maintain.

Sep 08 '16 18:09 defvol

OK, let's do this then! :)

It would probably make sense to start by uploading the static "historic" snapshots first.

Sep 12 '16 07:09 tyrasd

uploading the static "historic" snapshots first.

what are those?

I was thinking on the following steps to get started:

running a global script in another ec2 instance (or just copy the mbtiles files from the server - if anyone has ssh access).
upload buildings, hotprojects, and highways to Mapbox
make adjustments to the Map component in the frontend
prepare demo on a forked github page
once it works add the upload script to run.sh (so it runs on a daily basis)
submit changes from step 3 to the frontend repo
deploy map styles loaded from the Mapbox API

does it make sense?

Sep 12 '16 16:09 defvol

@tyrasd, I'm also really curious about the 'historic' snapshot, are you talking about the annual years?

Sep 12 '16 20:09 jenningsanderson

Yes, I meant the annual data that's used in the "compare time periods" slider-map.

Sep 13 '16 07:09 tyrasd

osm-analytics-cruncher osm-analytics-cruncher copied to clipboard

Strategy for optimizing computing resources

osm-analytics-cruncher
osm-analytics-cruncher copied to clipboard