operations icon indicating copy to clipboard operation
operations copied to clipboard

Set up US rendering server on AWS

Open pnorman opened this issue 1 year ago • 7 comments

Ref #637

  • [x] Create new AWS account, linked to others
  • [ ] Finalize cost estimate, based on specs similar to new Europe render servers and traffic equivalent to Pyrene when it handled the US + y/y growth
  • [ ] Get credits from AWS
  • [ ] Get Elastic IP address
  • [ ] Create EC2 instance with Ubuntu 22.04, EBS storage, using elastic IP
  • [ ] Setup chef on new instance and assign roles
  • [ ] Import DB, load test with z0-z12 background render
  • [ ] Setup endpoint in Fastly and slowly move traffic to it
  • [ ] When we're happy, get an EC2 instance savings plan to reduce costs

Outstanding questions

  • Do we bother with CFn for one always-on instance?
  • We're looking at m6g.16xlarge. We haven't run an ARM rendering server before, so we might need to go for m6a instead
  • We're assuming EBS will have the performance characteristics we need. If not, a d-type instance with ephemeral storage might be required

pnorman avatar Jul 15 '22 07:07 pnorman

EBS, if using GP3, is generally great. If you're looking for inexpensive high IOPS, cheapest is to make a bunch of small GP3 volumes as each has a 3000 IOPS baseline (and RAID 0/LVM them or whatever). ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives. If mod_tile is using blocking IO for reads, and I imagine mod_tile is, you may find you need fewer Apache threads/processes to get the same request throughput with GP3.

m6a/c6a/r6a isn't always a savings over m6i/c6i/r6i due to performance differences in memory. I've seen the Xeon chips work out significantly cheaper than Epyc in some situations. I'd benchmark both if the Gravitons don't work out.

Be prepared for your instance to fail. It just happens. Most instances will stay up for years, other will have hardware issues. Sometimes you'll get a warning in the Events of the EC2 console (and sent to email) where you'll have a few weeks to stop and start the instance. In other cases the recovery process will start the instance on new hardware. Any ephemeral stored data would of course be gone, so that's a big negative for using locally attached storage beyond a cache.

Just some thoughts from someone who has been using EC2 for over a decade.

MarkRose avatar Jul 27 '22 08:07 MarkRose

@MarkRose Thank you for the helpful insights.

Firefishy avatar Jul 27 '22 08:07 Firefishy

ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives.

Our sustained IOPS is 10k-20k, with peaks of 50k, so st1 isn't an option. My inclination is to start with a single maxed out gp3 and if necessary, split the tiles into their own volume.

The big unknown to me is latency, not iops. I don't know how that's going to impact performance.

pnorman avatar Jul 27 '22 08:07 pnorman

Solving this will also solve #637

grischard avatar Jul 29 '22 16:07 grischard

Depends on #660?

grischard avatar Jul 29 '22 16:07 grischard

Solving this will also solve https://github.com/openstreetmap/operations/issues/637

We're looking at replacing pyrene independent of this.

Depends on #660?

No, although they have some common parts for changing our account management

pnorman avatar Aug 02 '22 06:08 pnorman

Account has been created. Accessible via assumed role from master account.

Firefishy avatar Sep 08 '22 11:09 Firefishy