Voluntary disruption of the system

Active testing

Increase system resilience - e.g. Chaos Monkey at Netflix

Active testing
Agility Maturity Cards > Active testing
Voluntary disruption of the system

The concept of voluntary disruption of the system

Traditionally, testing is trying to find problems on the product, which is purely Shift Left Testing. When it comes to Shift Right Testing, quality is not just about the product but also about the system that delivers it. This topic comes up especially when the organization is concerned about operations and wants to see how the product and the organization handle events that can disrupt the system and its robustness, in a word, "resilience". Resilience is something that can be handled carelessly [Taleb 2012] and be able to face a “Black Swan” [Taleb 2010]. Such a robustness level requires multiple kinds of tests as any other NFR to test.

To test this resilience, attempts to disrupt the system should happen  in a production-like environment and also in production, for example by disabling certain parts, and see how the whole system responds. Even if an unexpected situation occurs, the expected quality characteristics of the product should not be significantly reduced.

In terms of Panarchy [Moustier 2020], testing the system resilience is trying to reach its Ω phase and ensuring that the system will reach by itself a k phase within the shortest time and the smallest amount of lost resources.

Impact on the testing maturity

DevOps pushes the paradigm “Fail big, Fail fast, fail often” [Pontefract 2018][Shore 2004][Khanna 2015] for both Shift left & right strategies because it helps to see issues very soon before Customers would complain about them; the more often this would happen, the less Customers would complain.

To achieve this, Testers may inject faults. Fault Injection Testing (FIT) in software can be performed at compile-time notably by changing the source code to simulate faults in the software system [Gillis 2019] or by overriding the behaviour of libraries [Marinescu 2009]. This highlights, for instance, how invoked parts would behave in case of unexpected context [Shore 2004]. The expected underlying development strategy is named “defensive programming” [McConnell 2004] which can be adapted to trigger the appropriate alert procedure.

Architecture of the library fault injector [Marinescu 2009]


FIT can also occur at runtime of software that would inject a fault into the software while it is running, under certain specific conditions such as a set duration or a schedule [Gillis 2019], such as Netflix's "Chaos Monkey". This approach is also called "Chaos engineering" which creates turbulence and unexpected conditions to see how the system adapts to compensate for local failures.

Security is among the causes of disruption that leads to very specific kinds of testing. Penetration Testing, most commonly known as “Pentesting”, are also among the voluntary disruption of the system. These tests consist in trying to spot the weakest points of the system and eventually reaching an asset to demonstrate how possible it is to do further harm.

The ultimate disruption that could happen to a system is a disaster, would it be natural or not. To face such events, mature organizations provide Business Continuity Plans and Disaster Recovery Plans. Unfortunately, plans are just wishful thinking as long as they are not facing reality; moreover, disasters do not happen on a regular basis. To enable the solution to face such Black Swans, simulations and exercises are usually triggered to test the Business Continuity Management System (BCMS) [BS-EN-ISO-22301 2019] [US-NFPA 1600 2010].

Agilitest’s standpoint on this practice

As a test script automation platform, Agilitest may support your attempts to disrupt the system. This could be done in conjunction with some fault injection actions done in parallel with the test scenario.

To fully automate such robustness testing scenarios, the design of the product should embed some testability items to enable piloting the underlying parts as per the fault injector architecture. This would lead to built-in quality [SAFe 2021-27].

To discover the whole set of practices, click here.

To go further

© Christophe Moustier - 2021