tack icon indicating copy to clipboard operation
tack copied to clipboard

bugfix(etcd) - use count.index for etcd instance subnets

Open lachie83 opened this issue 9 years ago • 9 comments

etcd instances are currently pinned to a single subnet which means a single AZ.

Pinning them cross AZ

lachie83 avatar Oct 10 '16 18:10 lachie83

Tested with make all and confirmed a got a clean k8s cluster

lachie83 avatar Oct 10 '16 18:10 lachie83

Thanks Lachlan. I'm hesitant to spread the etcd nodes across AZs until I have had a chance to thoroughly research the implications. I have read that when it comes to raft it is best to keep nodes as close to each other as possible. It also seems that the guidance from Kubernetes is to keep clusters within an AZ. Federation seems to be the recommended approach to multi -AZ -Region support - I will be introducing examples to tack in coming weeks.

Having said that I think that if people like yourself are happily running etcd across AZs that it would be worthwhile to include that as a readily available option in tack. I'll keep this pr open whilst investigating.

wellsie avatar Oct 11 '16 16:10 wellsie

Thanks @wellsie. What's your stance on failure domains with regard to this project? If it's a production-ready k8s cluster then we should document the fact that it's a vertically stacked cluster confined to a single AZ (and update the terraform to reflect). Given your statement above we should probably bind the worker ASG to the same subnet as the AZ of the etcd/master nodes, only use a single master/etcd node then stomp out clusters horizontally cross-AZ in-region.

lachie83 avatar Oct 11 '16 17:10 lachie83

So far I have been less concerned about spreading worker nodes across AZs - they will continue to run without the control plane. If one did not want to federate clusters then I think the current configuration would be a good starting point.

I'm not sure that running a single etcd node per cluster, even with a federated solution, would be prudent.

wellsie avatar Oct 12 '16 16:10 wellsie

Thanks @wellsie - LMK if I can be of any assistance with the investigation.

lachie83 avatar Oct 12 '16 19:10 lachie83

We've been running clusters with both etcd and masters split across AZs and I haven't noticed any problems so far. Then again our clusters are pretty small and we haven't had any problems with AZs themselves either.

mirthy avatar Oct 17 '16 17:10 mirthy

etcd v2 is very robust comparing with etcd v1, there are no issues spreading etcd nodes across AZs :)

rimusz avatar Oct 27 '16 14:10 rimusz

@rimusz @lachie83 Hey y'all—do you know of documentation/case studies/etc re: etcd v2 stability across data centers/AZs?

seanknox avatar Oct 30 '16 19:10 seanknox

Hi @seanknox. Unfortunately I do not. It might be worth looking into writing one up. LMK if you come across anything in your travels

lachie83 avatar Oct 31 '16 05:10 lachie83