“The way forward is to assume that failure for critical infrastructure will happen in an unforeseen way, and to start building those assumptions in the overall system design,” Red Hat Chief Technologist for FSI, APAC, Vincent Caldeira told PYMNTS. Find out how Caldeira builds for resilience in Black Swan, a special report exclusively from PYMNTS.
The limitation black swans really highlight is our blindness with respect to randomness – in particular, outlier events with an extremely large impact, which can be caused and exacerbated by the fact that they are unexpected. This is a very relevant topic in our approach to building resilience in critical technology infrastructure of the financial system that supports our economy.
Traditionally, designing such infrastructure for high resilience has meant assessing systems for potential single points of failure and removing them, providing redundancy at all possible layers of infrastructure and software, and automating recovery processes against all potential known scenarios of failure.
While this is definitely required, I believe it would also be short-sighted to think this is sufficient. One case in point is financial institutions (FIs) moving their workloads to public cloud providers, with the view that cloud-distributed systems with multiple data centers providing seamless redundancy will be less vulnerable to failure than their own infrastructure. However, in April 2011, many Amazon Web Services (AWS) customers, including some high-profile websites, lost access to their systems following an outage at one of the North American data centers.
The reality is that in normal conditions, cloud systems – which are designed to be self-monitoring and self-repairing – can handle expected failures seamlessly, with little to no impact on users. When something unexpected goes wrong, however, the behavior of such complex systems becomes hard to predict.
Consequently, the way forward is to assume that failure for critical infrastructure will happen in an unforeseen way, and to start building those assumptions in the overall system design. This approach would not rely simply on redundancy, but rather the design of a componentized, distributed system of loosely coupled service components and data stores, so that “mission-critical” applications can keep working even when things go wrong. This also ties in with distributed architectures based on microservices and event-driven, inter-service communication, and the ability to automatically move system components across the hybrid cloud to provide computing and storage resources whenever needed on short notice.
Read more executives' insights on the COVID-19 crisis in Black Swan.