Reliability Testing (RT) is one of the Non Functional Requirements that is described by the ISO 25010. According to this standard, RT is a degree to which a system, product or component performs specified functions under specified conditions for a specified period of time. It addresses matters such as how reliable should the system be when
Therefore, reliability is mainly a matter of
to which other factors such as security (including confidentiality and integrity), maintainability, durability, and maintenance support can be added.
In the hardware-based industry, there are a lot of reliability models that are based on the differences between units and performance shifting due to material fatigue with time [Elsayed 2012].
RTs are not to validate a product which requires a failure-free simulation process. RTs are rather a screening process that requires stimulation to expose latent defects in products that would otherwise fail in the field [Dodson 2006]. Ideally, on an assembly line, this screening process is to be applied to every unit but a trend on how reliable your units are may emerge from samples. This sampling approach will provide a reliability ratio but also a confidence rate, the bigger the sampling the more confident the reliability ratio.
Example of reliability demonstration sample sizes [Dodson 2006]:
Reliability helps to predict the software quality by using probability theory & statistical analysis with a set of techniques and models based on [Moharil 2019].
This approach is slightly different with the software industry. According to some, it is not applicable to software products [IEEE 24765:2010][ISO25010], probably because digital easily clones data and products. Moreover, the versatility of the system is submitted to numerous upgrade versions and disables stable statistical models to emerge. Therefore, software reliability is described as the probability of failure-free operation of the software for a given period of time in an assigned environment. Likewise, the reliability of the system is the ability to perform required operations or functions under the given condition for a specified period of time [Moharil 2019]. This time domain approach to reliability is also known as “Software Reliability Growth Models” (SRGMs). It is used to assess the current and future reliabilities. It can also be used to serve as an exit criterion to stop testing, or to estimate time or resources needed to reach a reliability target [Tian 2005][Moharil 2019]. SRGMs can also be perceived from an input domain perspective. These “Input domain reliability models” (IDRMs) are used to analyze input states and failure data which provides valuable information relating input states to reliability [Tian 2005][Moharil 2019].
Reliability is something that is demonstrated over time and testing helps to accelerate this demonstration. However, experts suggest it is impossible to accelerate a test by more than a factor of 10 without losing some correlation to real-world conditions; therefore, testing is a balance between science and judgment [Dodson 2006].
When it comes to reliability, not only the delivered software should be evaluated but also the delivered services around it. The easiest way to perceive this holistic point of view is the Customers’. Say a bug appears in production and stays unsolved for a while, End Users would inevitably evaluate the reliability of the system from the moment when the issue appears until it is solved, the Mean Time To Recovery (MTTR). Moreover, if issues appear too often, it may cause annoyance even if quickly solved; this period of bugless time is named “Mean Time Between Failures” (MTBF). The MTTR and MTBF values are to be closely monitored. Google’s SRE takes specific attention to these availability metrics [Beyer 2016]. To enable a real-time-based monitoring of those metrics, sensors should be coded within the product and linked with monitoring tools.
Unfortunately, even if this sensor-based approach is compulsory to handle negative impacts as soon as issues appear, it is a Shift Right Testing technique; therefore, this should be completed with some Shift Left techniques in order to be proactive. For instance, as a rule of thumb, the number of issues found divided by the test campaign duration can be used to assess MTBF improvements. Indeed, the genuine MTBF cannot be defined from test campaigns because the distribution laws of use cases run in vitro vs in vivo are not the same [Dodson 2006]; i.e. a typical test campaign embeds few standard situations against a bunch of weird cases while real life situations are most of the time well known situations; moreover, those corner cases are most likely to appear after a while rather than after a couple of uses. To accurately model field reliability, test cases should match feature usage frequency [Dodson 2006] which is economically hard to meet.
It is generally admitted that “limitations in reliability are due to faults in requirements, design, and implementation” [IEEE 24765:2010][ISO25010] but the whole system must be considered. For instance, in an SLA/SLO approach, an alternative is to involve the worst-case situations [Dodson 2006] to prevent penalties. This can then be combined with an Error Budget stratagem. From a Customer’s perspective, the issue recovery delay is very important since it measures the availability of the system which can be mainly measured through the MTTR. Improving MTTR is aiming for both high resilience of the product to issues and the organization that handles those issues since fixing the bug introduces delays. Whenever one part of the whole system is failing, the MTTR is likely to be impacted and theory of constraints should be involved to manage the flow and the balance between those parts. This holistic view tends to prove software-based systems are actually submitted to fatigue, notably due to the human organization that takes care of the product but our daily experience with personal computers shows that software systems wear out because they are not finite automata and entropy slowly alters systems.
In an agile delivery process, reliability testing on the product should be applied at least to Product Increments as per the “Agile testing quadrants” [Marick 2003][Bach 2014][Crispin 2021]. This approach should help to define applicable NFR to a given US or to the whole product; this would lead to introducing some criteria on the Definition of Done (DoD).
As seen previously, “Wear or aging does not occur in software” [IEEE 24765:2010] [ISO25010] but when it comes to testing, test cases become less and less effective as per the pesticide paradox principle [ISTQB 2018] and become less reliable from a “showing an issue” perspective. When it comes to test scripts, the reliability of the scripts regarding aging effects is even more obvious because automation leads to multiple and frequent runs. Under those circumstances, building abacuses on test flakiness such as Dodson’s sample sizes shown above should become quite relevant for a given organization.
To improve test scripts reliability, it appears that removing “code smells” (coding antipatterns) through refactoring techniques [Fowler 1999] would reduce test flakiness [Palomba 2017]; regarding test cases, test smells should be also removed. Here is a first list of test smells that would need refactoring [Deursen 2001]:
All those test smells cumulated with code smells participate to test reliability.
Another factor that leads to reliability on tests is their fault tolerance to unexpected situations that do not lead to failing the test objective. The strategy to give some autonomy to automation is named Jidoka [Monden 2011], a practice from Lean Management which handles automatically incidents to lower human interventions notably on test scripts [Moustier 2020].