[TECH]:: Failing over with falling over / When failovers fail / Failover without collapsing in practice / ...
*Good article with numerous references on failover failings, and how to
(try to) avoid them. *
*This post was written by Adrian Cockcroft, VP of Cloud Architecture
Strategy at AWS. If you want to learn more he will be speaking at the *
*AWS Chaos Engineering and Resiliency Series
<https://aws-amazon-event-chaosengineering.splashthat.com/> online event,
taking place Oct 27-28th from 11am-2pm in the AEDT timezone.*
"I’ve been working on resilient systems for many years. In the 1990’s as a
Distinguished Engineer at Sun Microsystems, I helped some of the first
Internet sites through their early growth pains and joined eBay in 2004
with the title Distinguished Availability Engineer. In 2007 Netflix hired
me to help them build scalable and resilient video streaming services and
in 2010 I led their transition to a cloud-based architecture on AWS. For
the last ten years, I’ve been building and talking about Chaos Engineering,
multi-zone and multi-region cloud architectures, and modernizing
As applications move online and digital automation extends to control more
of the physical world around us, software failures have an increasing
impact on business outcomes and safety. We need to develop more resilient
systems, and that can’t be left as an operational concern. Engineers need
to architect resilience into the application code, and operability is one
of the most important attributes of a resilient system. The operator
experience needs to be clear and responsive, especially during a failure.
We’ve seen many examples of small initial problems escalating, as poorly
designed and tested error-handling code and procedures fail in ways that
magnify the problem, and take out the whole system.
What can we do about this? To start with, it’s a shared responsibility
across your technical teams to build and operate systems that are
observable, controllable, and resilient. With the integration of roles from
DevOps practices and the automation provided by cloud providers, we need to
adapt common concepts and terminology that already exist in resilient
systems design for cloud-native architectures.
What should your system do when something you depend on fails? There are
three common outcomes. It could stop until whatever failed is restored; it
could work around the failure and continue with reduced functionality; or
it could fall over and cause an even bigger failure! Unfortunately, for
many systems today, the third outcome is the default, either because they
were not architected to survive or, more critically they were not tested
regularly to prove that the architected solutions worked as intended. ..."