dialogue Prefer routing traffic within an AWS availability zone to save $$$

Most of our services have nodes in different availability zones. Given that there's nothing constraining traffic in any aws-specific way, every time a node of email-service wants to talk to a node of MP, it might pick a node in any region. This means we're probably paying $$$ in cross-AZ traffic when we don't need to.

Pricing diagram from this blog post

It seems like if we can slightly bias connections towards staying in their region (e.g. eu-west-1a <-> eu-west-1a) then we'd be able to cut down on our spend a bit.

Proposal

When a server is running on AWS, there's a magic IP address we can call to find out which region it's currently in, e.g.

$ curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone
eu-west-1a

(Atlasdb currently uses this).

Then, servers could either advertise this information somehow. Either using a header, or a dedicated metadata endpoint, or perhaps even plumbed through yaml somehow. We might even be able to DNS resolve the hosts we're given and match them against amazon's published IP ranges: https://ip-ranges.amazonaws.com/ip-ranges.json.

With this information, I'd suggest that we add a tiny constant bias to the Balanced Channel's scores, so rather than starting everything off at 0, we'd say hosts that are in other availability zones get a minimum score of 1. This would mean that under zero utilization, the first request would always go intra AZ.

Possible downsides?

Obviously this would need to fail gracefully when running locally, in docker or on Azure.

Apr 28 '20 17:04 iamdanfox

How do client preceived latencies differ between nodes in different AZs? I'd rather use that data to rank targets than to target specific cloud vendors in an rpc library. Another option is for deployment infrastructure to provide a quality-factor based on availability zones along with URIs, centralizing that discovery.

Apr 28 '20 17:04 carterkozak

So the idea here is more about $ savings than latencies tbh

Apr 28 '20 17:04 iamdanfox

Right, we can solve the problem without vendor-specific implementation.

Apr 28 '20 17:04 carterkozak

Latencies are the same +- 0.1, 0.2ms.

Apr 28 '20 20:04 j-baker

basically - this isn't a perf thing - it's a spend thing. And just to be clear it's not $0.01 as the doc implies - Amazon are sneaky and charge you on the way in and on the way out for $0.02 per GB.

Apr 28 '20 20:04 j-baker

and with latencies esp when transitives are involved you also start taking into account their good or bad decisions - because with latency you can't help but care about all the hops, whereas you really want to care about only the one you'd like to make. But nice try :)

Apr 28 '20 20:04 j-baker

Again, my point is that this is the wrong place to approach that type of problem.

Apr 28 '20 20:04 carterkozak