Chaos Engineering – increasing the robustness of systems
Software development seeks to create useful software with little effort, where software is useful if it works correctly, reliably, safely and dependably, using existing resources efficiently.1 The reliability of software and systems is usually checked with unit tests, integration tests and system tests – collectively known as the test pyramid. However, as distributed systems become increasingly comprehensive and complex, such tests reach their limits. This is where chaos engineering comes in.
Chaos engineering attempts to increase the robustness of systems by testing the system under consideration for its resilience beyond and deliberately not within the specified performance characteristics. The aim is not to find errors, but to create errors. For example, systems are put under (too) great a load and (too much) stress, whereby the consequences can only be predicted to a limited extent. In other words, chaos engineering is an experimental approach to confirm or refute hypotheses beyond specifications. The goal is to identify elementary weaknesses, eliminate them and thus increase the robustness or resilience and performance of the system.
The beginnings of Chaos Engineering
The pioneer of chaos engineering was the streaming service Netflix. In 2011, the company developed Chaos Monkey, a tool for testing the resilience of IT infrastructures. A Chaos Monkey is a monkey that rampages through a data centre, ripping out cables at random and destroying equipment. The challenge is to design the systems to work in spite of such monkeys.2 Deliberately, therefore, Netflix began disabling computers on the production network to test how the remaining systems reacted to the outage. Today, Chaos Monkey is part of a suite of tools called Simian Army3 and chaos engineering is significant for many (internet) companies with very large, distributed systems because systems are so complex and robustness and resilience are key competitive and revenue drivers. An overview of companies known to use chaos engineering, including tools and practices, can be found here.
The process of Chaos Engineering
Although the name may suggest otherwise, chaos engineering is not chaotic, but planned. The first thing to do is to determine which areas or components are to be stressed beyond their limits:
- network or
These three areas are sometimes referred to as levels or tiers in chaos engineering.4 A combination of levels is of course conceivable. Subsequently
- a hypothesis is defined and
- the extent of possible damage is predicted (also called the explosion radius),
- the test is started,
- the effects are observed and measured with appropriate key figures,
- and the key figures are evaluated.
If errors and weak points are identified, then the goal of the procedure is achieved. If the systems continue to function despite the test, then the explosion radius is increased – for example, by increasing parallel access to an infrastructure, sending larger amounts of data in a network or switching off additional components. Only as soon as the system is “broken” and errors have been generated does the process end and the elimination of the vulnerabilities found begins (which in practice can also take quite some time and cause high expenses). Ideally, the entire process is then automated.
Tips for Chaos Engineering
In practice, it is advisable to test aspects in a certain order and step by step5:
- Known Knowns – things that companies are aware of and understand.
- Known Unknowns – things that companies know but do not fully understand.
- Unknown Knowns – things that companies understand but are not aware of.
- Unknown Unknowns – Things that companies are neither aware of nor fully understand.
In addition, there are a few more tips:
- It is important to know the status quo of a system, including system context and functioning. This knowledge is a prerequisite for all hypotheses.
The system, the interaction of different components, the flow of data must be observed. Monitoring is the corresponding keyword.
- Metrics are needed to measure the results.
- Step by step, tests should be automated. The goal could be a transfer to a Continuous Integration / Continuous Delivery pipeline.
- In so-called Site Relaibility Engineering (SRE), there is a concept of game days, where old incidents are replayed. Chaos engineering also follows this approach; formulated differently: without the elimination of identified vulnerabilities, the concept makes little sense.
- It is not possible to foresee all conceivable errors in chaos engineering. This is precisely why hypotheses, metrics and the assessment of findings are very important.
- Communication within the company is also important. On the one hand about the basic procedure, on the other hand about a concrete test.
- And of course there should also be a rollback plan. After all, it is an experiment to gain knowledge and not an approach to permanently destroy the system under consideration.
Advantages of Chaos Engineering
In practice, it happens again and again that a chain of situations occurs for which companies are not prepared and which can often be very expensive. If, for example, there is a system failure at a car-sharing provider and no cars can be rented as a result, the loss of revenue increases by the minute. Companies therefore have a great interest in fault-tolerant, redundant and thus resilient systems. Chaos engineering is one approach to developing and operating such systems.
There are some significant advantages to using the approach:
- Increasing the robustness or resilience of a system is the ultimate benefit.
- The application sharpens the senses for the possible existence and resulting consequences of component or system failures. For example, it reveals dependencies within and between different technical levels that are not always obvious to everyone involved. This in turn increases the understanding of the system’s behaviour.
- Once the findings have been evaluated and the weak points discovered have been eliminated, the experiment can be automated. In this way, it contributes directly to the reliability of the system.
- And last but not least, the approach generates errors in a controlled environment before they actually occur and cause real damage. This avoids future problems.
For the sake of completeness, it should be mentioned that the approach is of course also costly, the elimination of identified vulnerabilities requires planning, implementation and verification, and financial and organisational resources are needed for this. Despite all the advantages, companies must be able to afford chaos engineering.
Impulse to discuss:
The more comprehensive and complex production systems are, the more difficult it is to operate corresponding test systems. Does this mean that chaos engineering only takes place in live operation or can it also be carried out in test systems?
 Goals in software development (in German)
 Antonio Garcia Martinez explains the term Chaos Monkeys in his book of the same name.
 A Brief History of Chaos Engineering
 With people, practices and processes, there are also organisational levels that should be considered (e.g. are there processes that have a negative impact on individual applications?). Opinions vary as to whether these levels can/should be addressed in terms of chaos engineering.
 Which experiments do you perform first? (including an example.)
On Gremlin’s website you can download a 2021 State of Chaos Engineering Report.
If you are interested in an exchange on this topic, you can become a member of a worldwide community here.
And here you will find additional information from our Smartpedia section: