system-design-and-architecture icon indicating copy to clipboard operation
system-design-and-architecture copied to clipboard

Availability and Resilience

Open puncsky opened this issue 5 years ago • 1 comments

Syed X, [Nov 18, 2019 at 3:10:51 AM]:

Hello All,

Has anyone worked on Data center consolidation, upgrade projects with an emphasis on Availability and Resilience requirements?

I have a potential interview coming up and would need the below items for the preparation.

  1. Resilience strategies
  2. HA strategies
  3. DR strategies
  4. Most important - challenges faced and how they were addressed.

Thanks in advance

puncsky avatar Nov 19 '19 09:11 puncsky

Though all of those concepts of Fault Tolerance, High Availability, Disaster Recovery are improving availability, they are slightly different http://www.pbenson.net/2014/02/the-difference-between-fault-tolerance-high-availability-disaster-recovery/

failover: https://tianpan.co/notes/85-improving-availability-with-failover

Resilience strategies: I am not an expert on this but I guess the book "antifragile" answers the principles of it. netflix chaos monkey.

HA: https://en.wikipedia.org/wiki/High-availability_cluster

DR: FB TAO did replication pretty impressive https://tianpan.co/notes/49-facebook-tao, I assume some google papers specify even better solutions

challenges: failure is always an option. I guess building a system (people+machine) that handles failures automatically and escalates properly is the most challenging part because it is not just an engineering problem but also management problem.

puncsky avatar Nov 19 '19 09:11 puncsky