The idea of chaos engineering

Guest contribution by Felix Stein, Agile Expert | 31.07.2023

Table of Contents

The logic of chaos engineering
The core and principles of chaos engineering
What does chaos engineering have to do with agility?
Scrum and “social chaos engineering”
Conclusion

Among the various agile frameworks, chaos engineering is one of the lesser known, and is likely to remain so. There are no well-known founding figures as in the case of Scrum, no commercial organisation behind it as in the case of SAFe and (a key point) no certificates through which the agile-industrial complex¹ would contribute to its spread. Instead, chaos engineering has something quite different: practical relevance.

The logic of chaos engineering

The approach was developed around 2010 at Netflix², the American video streaming service. Over time, it became painfully clear to the developers there that they could neither simulate the system behaviour nor the user behaviour on their test environments exactly as it occurred in the unpredictable reality. Repeated failures occurred as a result. Their solution: to shift a large part of the quality assurance to live operation.

The underlying logic: if faults in the live environments are unavoidable, then they should at least be able to be detected quickly and remedied immediately. And if possible, faults should occur during working hours when someone is there to fix them quickly and not sometime during the night. To achieve this, the systems should be subjected to constant stress tests during the day so that these errors can be triggered and fixed.

“Knowing that there would be guaranteed server outages, we wanted these outages to occur during business hours when we are on site to fix any damage. We knew we could rely on our technicians to come up with resilient solutions if we gave them the context in which to expect server outages. If we could train our technicians to develop services that could survive a server failure as a matter of course, then it wouldn’t be a big deal if it happened accidentally. In fact, our members wouldn’t even notice it. This was the case.”³

The core and principles of chaos engineering

The technical core of Chaos Engineering is the so-called Monkey Army⁴, a group of programmes that carry out these tests. The best known of these are the Chaos Monkey, which randomly shuts down live servers for a short time, and the Chaos Gorilla, which does the same with entire data centres. In addition, there are the Latency Monkey, the Conformity Monkey and the Security Monkey, among others, most of which are published as open source.

The methodological framework around the technique is formed by the principles of chaos engineering⁵, which are a rough guide for the application of the framework. At the core are the four basic principles:

First, define the term “steady state” as a measurable performance of a system that indicates normal behavior.
Hypothesise that this stable state will persist in both the control group and the experimental group.
Introduce variables that reflect real events, e.g. servers crashing, hard drives not working, network connections being interrupted, etc.
Try to disprove the hypothesis by looking for a difference between the control group and the experimental group.

In other words, define a measurable, stable baseline condition, unleash the monkey army on it and when something breaks, find out the cause. It is important to note at this point that these principles do not yet mention testing on the live environment, which is central to Netflix. This makes Chaos Engineering usable for critical applications, such as the operation of power grids, which should not be shut down for testing.

For other applications, where a downtime of a few seconds or minutes is not critical, there are the “Advanced Principles”, which are more difficult to implement, but also bring a much higher added value:

Create a hypothesis about the behaviour in steady state.
Vary real world events.
Experiment in production.
Automate experiments for continuous execution.
Minimise the blast radius.

Again, in simplified terms: define a measurable, stable initial state in the live environment, unleash the monkey army on it in ever-new variations, and work to minimise the impact. The last point is the big goal here – through ever better fallback mechanisms and ever greater independence of components and regions, the impact of errors and failures becomes smaller and smaller.⁶

What does chaos engineering have to do with agility?

Isn’t it about time for the section on roles, meetings, responsibilities and delivery cycles? Not really. Such elements do exist in some frameworks (especially Scrum and SAFe), but agility is something much more fundamental: the ability to deliver and respond in short intervals – and Chaos Engineering definitely contributes to this with its focus on rapid failure detection and system resilience.

Beyond that, however, there is a second “agile aspect”, because it is not usually attempted to bring about this system resilience all at once, i.e. in a big bang⁷. Instead, it is supposed to grow and expand gradually, which corresponds quite closely to the iterative-incremental approach of practically all agile frameworks. In a sense, the system resilience itself is the product that is constantly being developed based on real application experiences.

In implementation, this “incremental resilience” can look like exposing the system to smaller, still manageable disruptions first. As soon as there is a functioning compensation mechanism for these, somewhat larger ones can follow, and as soon as these can also be compensated for, larger ones again, etc. Examples of such smaller disturbances would be moderate increases in use at the beginning or an initially only slight reduction in the available storage space.

These examples give a good idea of what the increasingly demanding experiments (this is what stress tests are called in chaos engineering) might look like. The extent of the disruptive factors (in these cases, increasing usage intensity and shrinking storage space) can be ramped up with each experiment, to the point where the complete failure of an entire AWS region⁸ is simulated.

Furthermore, in later “resilience increments” it also makes sense to run different experiments simultaneously on the same system in order to also recognise possible interdependencies. To stay with the examples: simultaneously increasing intensity of use and shrinking storage space, then in a next experiment additionally the simulation of transmission disturbances or failing lines.

Scrum and “social chaos engineering”

In theory, chaos engineering could even be implemented according to Scrum, with one experiment as a sprint goal⁹ and the associated implementation, monitoring and stabilisation measures as backlog items. The subject of the Sprint Reviews would be the system failures that were successfully prevented (or did occur), the invited stakeholders would be those responsible for the applications operated there, who could then say how much more fail-safe they needed.

In such contexts, one could even go one step further and apply chaos engineering to the implementing team itself. Again, individuals can be temporarily removed from the workflows to see if the others can compensate for this failure. If this is not the case, a problem has been identified that would cause disruption or downtime the next time someone is on holiday or sick.

Fixing the problem is then relatively simple, because in almost all cases one of two root causes is present: either the other employees lack the knowledge to take over the activities of the absent colleague. This can be compensated for by moving them towards T-shape or full stack. Or they lack access rights to the systems for which the absent colleague is responsible, which can be solved by granting them.

A third root cause, which is fortunately becoming rarer, is when the failing colleague has Code Ownership in parts of the application, i.e. it has been agreed that he is the only one allowed to make changes there. Another variant of this problem is when only one person is allowed to accept pull requests for certain areas, i.e. to approve changes. In both cases, the solution is simple – you just have to abolish these limiting rules.

The results of “social chaos engineering” are often surprising because knowledge monopolies, authorisation bottlenecks and Code Ownership are not always made explicit. Often, they have emerged rather unnoticed and their scope is underestimated even by those involved. This makes it all the more important to find out whether they are present. And there is also a nice side effect: the temporarily absent colleagues have time for further training or overtime reduction.

Conclusion

Chaos Engineering is an agile framework that is so interesting because it is recognisably different from the others. You just have to get away from seeing “Agile” as just a collection of roles, meetings and processes if you want to recognise the meaning in it. As said at the beginning, it will not become the big hype, it is too technical and not profitable enough for the HR departments, management consultancies and training providers (as there are no certifications). In most development units, however, the idea will meet with interest or even enthusiasm, as it is immediately clear there what contribution to agility and resilience can be made by this framework. And with social chaos engineering, it can even be used for team development.

Notes (partly in German):

If you like the post or want to discuss it, feel free to share it in your network.

[1] Was ist der agil-industrielle Komplex?
[2] [4] The Netflix Simian Army
[3] Netflix Technology Blog: Chaos Engineering Upgraded
[5] Principles of Chaos Engineering
[6] Hier ein Beispiel: From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
[7] Bing Bang-Releases
[8] AWS suffers another outage as East Coast datacenter loses power
[9] Sprinziele

Felix Stein regularly publishes articles in the justifiably very well-known and popular German-language blog On Lean and Agility. Definitely worth more than one visit!

Felix Stein has published another article in the t2informatik Blog:

The idea of #NoEstimates

Felix Stein

Felix Stein was once one of those project managers who constantly asked why everything was taking so long. His decision to find out what the problem was and to look for ways to improve it had greater consequences than he thought at the time – he has now been working as an Agile Coach, Lean Coach, Scrum Master and in various other roles in the agile environment for more than a decade.

Felix Stein is co-founder and co-owner of Agile Process GmbH, a company that supports agile transitions and is also organised internally according to principles such as openness, transparency and eye level.

t2informatik develops and modernises software as a service provider. Click here to learn more.