Reliability testing

Active testing

Fault tolerance of the solution

Active testing

What is reliability testing?

Reliability Testing (RT) is one of the Non Functional Requirements that is described by the ISO 25010. According to this standard, RT is a degree to which a system, product or component performs specified functions under specified conditions for a specified period of time. It addresses matters such as how reliable should the system be when:

  • a user is using it to accomplish his task
  • it is updated with new content
  • it is maintained or ported
  • it is published

Therefore, reliability is mainly a matter of:

  • Maturity: The degree to which a system, product, component or quality feature meets reliability requirements under normal operating conditions.
  • Availability: The degree to which a system, product or component is operational and accessible when it is to be used - availability can be measured by the length of time the system, product or component is operational.
  • Fault tolerance: The degree to which a system, product or component performs as expected despite the presence of hardware or software failures.
  • Recoverability: The degree to which, in the event of an interruption or failure, a system can recover directly affected data and restore the desired state of the system - recovery time is also part of this definition.

to which other factors such as security (including confidentiality and integrity), maintainability, durability, and maintenance support can be added.

In the hardware-based industry, there are a lot of reliability models that are based on the differences between units and performance shifting due to material fatigue with time [Elsayed 2012]. 

Reliability tests are not to validate a product which requires a failure-free simulation process. Reliability tests are rather a screening process that requires stimulation to expose latent defects in products that would otherwise fail in the field [Dodson 2006]. Ideally, on an assembly line, this screening process is to be applied to every unit but a trend on how reliable your units are may emerge from samples. This sampling approach will provide a reliability ratio but also a confidence rate, the bigger the sampling the more confident the reliability ratio.

Example of  reliability demonstration sample sizes [Dodson 2006]:

Example of  reliability demonstration sample sizes [Dodson 2006]

Reliability helps to predict the software quality by using probability theory & statistical analysis with a set of techniques and models based on  [Moharil 2019].

This approach is slightly different with the software industry. According to some, it is not applicable to software products [IEEE 24765:2010][ISO25010], probably because digital easily clones data and products. Moreover, the versatility of the system is submitted to numerous upgrade versions and disables stable statistical models to emerge. Therefore, software reliability is described as the probability of failure-free operation of the software for a given period of time in an assigned environment. Likewise, the reliability of the system is the ability to perform required operations or functions under the given condition for a specified period of time [Moharil 2019]. This time domain approach to reliability is also known as “Software Reliability Growth Models” (SRGMs). It is used to assess the current and future reliabilities. It can also be used to serve as an exit criterion to stop testing, or to estimate time or resources needed to reach a reliability target [Tian 2005][Moharil 2019]. SRGMs can also be perceived from an input domain perspective. These “Input domain reliability models” (IDRMs) are used to analyze input states and failure data which provides valuable information relating input states to reliability [Tian 2005][Moharil 2019].

Reliability is something that is demonstrated over time and testing helps to accelerate this demonstration. However, experts suggest it is impossible to accelerate a test by more than a factor of 10 without losing some correlation to real-world conditions; therefore, testing is a balance between science and judgment [Dodson 2006]. 

Impact on the testing maturity

When it comes to reliability, not only the delivered software should be evaluated but also the delivered services around it. The easiest way to perceive this holistic point of view is the Customers’. Say a bug appears in production and stays unsolved for a while, End Users would inevitably evaluate the reliability of the system from the moment when the issue appears until it is solved, the Mean Time To Recovery (MTTR). Moreover, if issues appear too often, it may cause annoyance even if quickly solved; this period of bugless time is named “Mean Time Between Failures” (MTBF). The MTTR and MTBF values are to be closely monitored. Google’s SRE takes specific attention to these availability metrics [Beyer 2016]. To enable a real-time-based monitoring of those metrics, sensors should be coded within the product and linked with monitoring tools.

Unfortunately, even if this sensor-based approach is compulsory to handle negative impacts as soon as issues appear, it is a Shift Right Testing technique; therefore, this should be completed with some Shift Left techniques in order to be proactive. For instance, as a rule of thumb, the number of issues found divided by the test campaign duration can be used to assess MTBF improvements. Indeed, the genuine MTBF cannot be defined from test campaigns because the distribution laws of use cases run in vitro vs in vivo are not the same [Dodson 2006]; i.e. a typical test campaign embeds few standard situations against a bunch of weird cases while real life situations are most of the time well known situations; moreover, those corner cases are most likely to appear after a while rather than after a couple of uses. To accurately model field reliability, test cases should match feature usage frequency [Dodson 2006] which is economically hard to meet.

It is generally admitted that “limitations in reliability are due to faults in requirements, design, and implementation”  [IEEE 24765:2010][ISO25010] but the whole system must be considered. For instance, in an SLA/SLO approach, an alternative is to involve the worst-case situations [Dodson 2006] to prevent penalties. This can then be combined with an Error Budget stratagem. From a Customer’s perspective, the issue recovery delay is very important since it measures the availability of the system which can be mainly measured through the MTTR. Improving MTTR is aiming for both high resilience of the product to issues and the organization that handles those issues since fixing the bug introduces delays. Whenever one part of the whole system is failing, the MTTR is likely to be impacted and theory of constraints should be involved to manage the flow and the balance between those parts. This holistic view tends to prove software-based systems are actually submitted to fatigue, notably due to the human organization that takes care of the product but our daily experience with personal computers shows that software systems wear out because they are not finite automata and entropy slowly alters systems.

In an agile delivery process, reliability testing on the product should be applied at least to Product Increments as per the  “Agile testing quadrants” [Marick 2003][Bach 2014][Crispin 2021]. This approach should help to define applicable NFR to a given US or to the whole product; this would lead to introducing some criteria on the Definition of Done (DoD).

Agilitest’s standpoint on this practice

As seen previously, “Wear or aging does not occur in software” [IEEE 24765:2010] [ISO25010] but when it comes to testing, test cases become less and less effective as per the pesticide paradox principle [ISTQB 2018] and become less reliable from a “showing an issue” perspective. When it comes to test scripts, the reliability of the scripts regarding aging effects is even more obvious because automation leads to multiple and frequent runs. Under those circumstances, building abacuses on test flakiness such as Dodson’s sample sizes shown above should become quite relevant for a given organization.

To improve test scripts reliability, it appears that removing “code smells” (coding antipatterns) through refactoring techniques [Fowler 1999] would reduce test flakiness [Palomba 2017]; regarding test cases, test smells should be also removed. Here is a first list of test smells that would need refactoring [Deursen 2001]:

  • Smell 1: “Mystery Guest” - the script relies on an external resource
  • Smell 2: “Resource Optimism” - the script makes optimistic assumptions about external resources
  • Smell 3: “Test Run War” - not all the people have the same results when they run tests 
  • Smell 4: “General Fixture” - test setups that are too generic, too hard to understand and take too long to run
  • Smell 5: “Eager Test” - the script tests many things, it becomes hard to understand and makes tests more dependent on each other
  • Smell 6: “Lazy Test” - test scripts are too close to one another and assess nearly the same thing
  • Smell 7: “Assertion Roulette” - there are no rationale to test scripts assertions; if the test fails, it is hard to know which assertion failed
  • Smell 8: “Indirect Testing” - testing becomes flaky when intermediate components are involved to assert results
  • Smell 9: “For Testers Only” - production code is added in a component only for testing purpose while it should rather be in a specific part
  • Smell 10: “Sensitive Equality” - asserting the expected result from part of internal values
  • Smell 11: “Test Code Duplication” - the code in the scripts is duplicated

All those test smells cumulated with code smells participate to test reliability.

Another factor that leads to reliability on tests is their fault tolerance to unexpected situations that do not lead to failing the test objective. The strategy to give some autonomy to automation is named Jidoka [Monden 2011], a practice from Lean Management which handles automatically incidents to lower human interventions notably on test scripts [Moustier 2020].

To discover the whole set of practices, click here.

To go further

  • [Fowler 1999] : Martin Fowler, Kent Beck - « Refactoring: Improving the Design of Existing Code » - Addison-Wesley Professional - 1999 - ISBN 0-201-48567-2
  • [IEEE 24765:2010] : Software & Systems Engineering Standards Committee of the IEEE Computer Society - 2010 - “ISO/IEC/IEEE 24765-2010(E), Systems and software engineering — Vocabulary”
  • [ISO 25010 2011] : British Standards Institution - 2011 - “Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — System and software quality models” - BS ISO/IEC 25010:2011
  • [ISTQB 2018] : ISTQB - 2018 - “Certified Tester Foundation - Level Syllabus” - 
  • [Marick 2003] : Brian Marick - « Agile testing directions : tests and examples » - 22/AOU/2003 -
  • [Moharil 2019] : P. N. Moharil & S. Jena & V.M. Thakare - MAY 2019 - “Enhancement in Software Reliability Testing and Analysis” - 
  • [Monden 2011] : Yasuhiro Monden - OCT 2011 - “Toyota Production System: An Integrated Approach to Just-In-Time” - ISBN 9781439820971
  • [Moustier 2020] : Christophe Moustier – OCT 2020 – « Conduite de tests agiles pour SAFe et LeSS » - ISBN : 978-2-409-02727-7
  • [Tian 2005] : Jeff Tian - FEB 2005 - “Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement” - ISBN: 0-471-71345-7

© Christophe Moustier - 2021