purl.obolibrary.org
purl.obolibrary.org copied to clipboard
Multiple PURL servers
Currently the new PURL system is running on a single server on Amazon Web Services under my account.
We can make our PURL system more robust using multiple servers at multiple sites, and use round-robin DNS or a fancier load balancing system to distribute reads and manage failure. Two or three servers would be sufficient, and even a small server should be able to handle the load.
@cmungall and @alanruttenberg have both mentioned the possibility of running a server.
Running multiple servers does add some complexity:
- we need to ensure that all the servers are kept in sync
- debugging and recovery can be a bit harder, e.g. if only one of three machines shows a problem
- logs will have to be amalgamated before we can do analysis, see #63
Since we switched to our own PURL solution in late 2015 we've been running the whole system on a single Amazon Web Services (AWS) Elastic Compute Cloud (EC2) micro instance. I've migrated to new instances a few times to stay ahead of maintenance reboots. Performance has never been an issue. Stability has been excellent. Cost is very reasonable. We have a handful of people with access to the server and the DNS records in case of emergency. The main problems we've had are with our DNS registration.
But I'm always worried that the server will die in the middle of the night and all our PURLs will fail until I wake up and see a hundred angry emails. So for years we've talked about a more resilient architecture. I want to finally move forward on that front. Here are my current thoughts.
I've been happy with AWS. Although there are ways in which AWS is a single-point-of-failure, a large-scale AWS outage means that nobody on the Internet is getting much done that day. Most of the resources we redirect to are directly or indirectly dependent on AWS. And there are enough good alternatives to the parts of AWS that we use (and that I'm proposing to use) that we aren't really locked in.
-
I think we should add a second PURL server (EC2 micro instance) in a geographically distinct region. The current server is in US East (N. Virginia). I propose that the second one be somewhere in Europe. Both will run the same code and pull from GitHub on the same schedule (currently 10 minutes).
-
Each server should have a Health Check for HTTP service. (We currently have "reachability checks" with alarms to notify me and @cmungall, but we specifically care about HTTP.)
-
We should switch to using AWS Route53 for DNS. It should be pointed at the two servers, with a latency routing policy (or geolocation or geoproximity?), and watch their Health Checks. The upshot is an active-active failover configuration across distinct AWS regions: users get routed to the "closest" server; both servers have enough capacity to handle all requests; if one server fails, all traffic is sent to the other one until we fix the problem.
Technically this is pretty simple. Let me know if you see any mistakes, or have suggestions or differing opinions.
Administratively it's a bit trickier. I've been running the EC2 instance from my own AWS account -- the cost is low and I'm happy to continue to do so. Our domain names are in a shared Google Domains account. I believe that we would have to move the obolibrary.org domain name to AWS Route53, and it would have to belong to someone's account, but we really want shared responsibility. If I get hit by a bus, that can't cripple OBO. Researching this today I see that AWS has their Resource Access Manager (RAM, https://docs.aws.amazon.com/ram/) that lets you share control of resources across AWS accounts, and AWS Organizations (https://aws.amazon.com/organizations/) that I don't really understand. They also have Identity & Access Management (IAM) stuff that I've only used in the simplest ways. My main complaint about AWS is that they have so many overlapping services.
Given this administrative complexity, we could stick with Google Domains, use an Elastic Load Balancer (ELB) instead of Route53, point DNS at the ELB, and get pretty much the same behaviour. I could do all that under my AWS account, and we'd have the current degree of shared control over DNS with Google Domains.
Overall, I think it would be better to solve the administrative problems properly, because that opens the door to other shared resources such as a community-wide integration testing system.
@OBOFoundry/purl-admin I'd appreciate feedback!
@jamesaoverton This looks like a very well thought out plan. I have a few observations that you may find more or less useful.
As far as administration goes, an organization plus IAM is what we use other places and seems to work fairly well as long as enough people have the right privs to make sure a bus doesn't create problems (e.g. still allow DNS transfer out of Route53).
A second server could be used as a fallback and make things more robust; it is also a slightly more "difficult" solution that other people would have to be fairly familiar with to ensure smooth running. Documentation can be good for things like this, but it may be a good idea to simulate and walk people through actual failures. As part of a more robust system, having a testing environment would be very useful for training and testing new code/deployments.