operations
operations copied to clipboard
Set up US rendering server on AWS
Ref #637
- [x] Create new AWS account, linked to others
- [ ] Finalize cost estimate, based on specs similar to new Europe render servers and traffic equivalent to Pyrene when it handled the US + y/y growth
- [ ] Get credits from AWS
- [ ] Get Elastic IP address
- [ ] Create EC2 instance with Ubuntu 22.04, EBS storage, using elastic IP
- [ ] Setup chef on new instance and assign roles
- [ ] Import DB, load test with z0-z12 background render
- [ ] Setup endpoint in Fastly and slowly move traffic to it
- [ ] When we're happy, get an EC2 instance savings plan to reduce costs
Outstanding questions
- Do we bother with CFn for one always-on instance?
- We're looking at m6g.16xlarge. We haven't run an ARM rendering server before, so we might need to go for m6a instead
- We're assuming EBS will have the performance characteristics we need. If not, a d-type instance with ephemeral storage might be required
EBS, if using GP3, is generally great. If you're looking for inexpensive high IOPS, cheapest is to make a bunch of small GP3 volumes as each has a 3000 IOPS baseline (and RAID 0/LVM them or whatever). ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives. If mod_tile is using blocking IO for reads, and I imagine mod_tile is, you may find you need fewer Apache threads/processes to get the same request throughput with GP3.
m6a/c6a/r6a isn't always a savings over m6i/c6i/r6i due to performance differences in memory. I've seen the Xeon chips work out significantly cheaper than Epyc in some situations. I'd benchmark both if the Gravitons don't work out.
Be prepared for your instance to fail. It just happens. Most instances will stay up for years, other will have hardware issues. Sometimes you'll get a warning in the Events of the EC2 console (and sent to email) where you'll have a few weeks to stop and start the instance. In other cases the recovery process will start the instance on new hardware. Any ephemeral stored data would of course be gone, so that's a big negative for using locally attached storage beyond a cache.
Just some thoughts from someone who has been using EC2 for over a decade.
@MarkRose Thank you for the helpful insights.
ST1 gives consistent hard drive like performance but latency is a bit higher than locally attached hard drives.
Our sustained IOPS is 10k-20k, with peaks of 50k, so st1 isn't an option. My inclination is to start with a single maxed out gp3 and if necessary, split the tiles into their own volume.
The big unknown to me is latency, not iops. I don't know how that's going to impact performance.
Solving this will also solve #637
Depends on #660?
Solving this will also solve https://github.com/openstreetmap/operations/issues/637
We're looking at replacing pyrene independent of this.
Depends on #660?
No, although they have some common parts for changing our account management
Account has been created. Accessible via assumed role from master account.