Voluntary disruption of the systemActive testing
Increase system resilience - e.g. Chaos Monkey at Netflix
The concept of voluntary disruption of the system
Traditionally, testing is trying to find problems on the product, which is purely Shift Left Testing. When it comes to Shift Right Testing, quality is not just about the product but also about the system that delivers it. This topic comes up especially when the organization is concerned about operations and wants to see how the product and the organization handle events that can disrupt the system and its robustness, in a word, "resilience". Resilience is something that can be handled carelessly [Taleb 2012] and be able to face a “Black Swan” [Taleb 2010]. Such a robustness level requires multiple kinds of tests as any other NFR to test.
To test this resilience, attempts to disrupt the system should happen in a production-like environment and also in production, for example by disabling certain parts, and see how the whole system responds. Even if an unexpected situation occurs, the expected quality characteristics of the product should not be significantly reduced.
In terms of Panarchy [Moustier 2020], testing the system resilience is trying to reach its Ω phase and ensuring that the system will reach by itself a k phase within the shortest time and the smallest amount of lost resources.
Impact on the testing maturity
DevOps pushes the paradigm “Fail big, Fail fast, fail often” [Pontefract 2018][Shore 2004][Khanna 2015] for both Shift left & right strategies because it helps to see issues very soon before Customers would complain about them; the more often this would happen, the less Customers would complain.
To achieve this, Testers may inject faults. Fault Injection Testing (FIT) in software can be performed at compile-time notably by changing the source code to simulate faults in the software system [Gillis 2019] or by overriding the behaviour of libraries [Marinescu 2009]. This highlights, for instance, how invoked parts would behave in case of unexpected context [Shore 2004]. The expected underlying development strategy is named “defensive programming” [McConnell 2004] which can be adapted to trigger the appropriate alert procedure.
FIT can also occur at runtime of software that would inject a fault into the software while it is running, under certain specific conditions such as a set duration or a schedule [Gillis 2019], such as Netflix's "Chaos Monkey". This approach is also called "Chaos engineering" which creates turbulence and unexpected conditions to see how the system adapts to compensate for local failures.
Security is among the causes of disruption that leads to very specific kinds of testing. Penetration Testing, most commonly known as “Pentesting”, are also among the voluntary disruption of the system. These tests consist in trying to spot the weakest points of the system and eventually reaching an asset to demonstrate how possible it is to do further harm.
The ultimate disruption that could happen to a system is a disaster, would it be natural or not. To face such events, mature organizations provide Business Continuity Plans and Disaster Recovery Plans. Unfortunately, plans are just wishful thinking as long as they are not facing reality; moreover, disasters do not happen on a regular basis. To enable the solution to face such Black Swans, simulations and exercises are usually triggered to test the Business Continuity Management System (BCMS) [BS-EN-ISO-22301 2019] [US-NFPA 1600 2010].
Agilitest’s standpoint on this practice
As a test script automation platform, Agilitest may support your attempts to disrupt the system. This could be done in conjunction with some fault injection actions done in parallel with the test scenario.
To fully automate such robustness testing scenarios, the design of the product should embed some testability items to enable piloting the underlying parts as per the fault injector architecture. This would lead to built-in quality [SAFe 2021-27].
To discover the whole set of practices, click here.
To go further
- [BS-EN-ISO-22301 2019] : British Standards Institute - 2019 - “Security and resilience. Business continuity management systems. Requirements.”
- [Gillis 2019] : Alexander S. Gillis - FEV 2019 - “Fault injection testing” - https://searchsoftwarequality.techtarget.com/definition/fault-injection-testing
- [Khanna 2015] : Rajat Khanna & Isin Guler & Atul Nerkar - MAR 2015 - “Fail Often, Fail Big, and Fail Fast? Learning from Small Failures and R&D Performance in the Pharmaceutical Industry” - https://www.researchgate.net/publication/276389377 or https://doi.org/10.5465/amj.2013.1109
- [Marinescu 2009] : Paul Marinescu & George Candea - SEP 2009 - “LFI: A Practical and General Library-Level Fault Injector” - https://www.researchgate.net/publication/224596722_LFI_A_Practical_and_General_Library-Level_Fault_Injector
- [McConnell 2004] : Steve McConnell - « Code Complete - a practical handbook of software construction » - Microsoft Press - 2004 - ISBN 0-7356-1967-0
- [Moustier 2020] : Christophe Moustier – OCT 2020 – « Conduite de tests agiles pour SAFe et LeSS » - ISBN : 978-2-409-02727-7
- [Pontefract 2018] : Dan Pontefract - « The Foolishness Of Fail Fast, Fail Often » - 15/SEP/2018 - https://www.forbes.com/sites/danpontefract/2018/09/15/the-foolishness-of-fail-fast-fail-often/?sh=3900bb159d9b
- [SAFe 2021-27] : SAFe - FEV 2021 - « Built-in Quality » - https://www.scaledagileframework.com/built-in-quality/
- [Shore 2004] : Jim Shore - OCT 2004 - “Fail Fast” - https://www.martinfowler.com/ieeeSoftware/failFast.pdf
- [Taleb 2010] : Nassim Nicholas Taleb - 2010 - “The Black Swan: The Impact of the Highly Improbable” - ISBN: 9780141906201
- [Taleb 2012] : Nassim Nicholas Taleb - NOV 2012 - “Antifragile: Things That Gain From Disorder” - ISBN: 9781400067824
- [US-NFPA 1600 2010] : National Fire Protection Association - 2010 - “Standard on Disaster/Emergency Management and Business Continuity Programs”