Anomaly handling policy

Active testing

Identify the actual degrees of processing on the propagation of anomalies and their treatment

Active testing

Policy to handle anomalies

At first glance, handling bugs is something obvious, they must be discarded! However, it appears that complying to this simple anomaly handling policy is not that obvious. From an economic view, the yearly cost of bugs in the US is estimated at  $113,000,000,000 [Initial State 2014] which would cover the expenses of baffling cumulated items such as ending hunger worldwide and buying famous IT companies. How is it that there is such a gap between the staggering cost of bugs and the fact that so little is obviously being done about them?

Firstly, testing appears to be a good answer in a Shift Left (SL) approach but since exhaustive testing is not possible [ISTQB 2018], some of the bugs are unfortunately sent into the production environment. Thus, SL must be combined with Shift Right (SR) techniques to help development Teams to be reactive on avoiding issues and bug handling from genuine End User who provide some feedback from their experience.

However, mending issues found is not enough. Removing the visible effects is actually sweeping dust under the carpet and a preventive approach on bugs is clearly much more efficient. A way to reach a proactive strategy is to perform Root Cause Analysis (RCA). It aims to find the source of an issue and remove the root that causes the issue. This strategy can be even more efficient when combined with a Pareto analysis in order to focus on the 20% of bugs that cause 80% of the issues. This becomes possible when you start to differentiate between incidents (an event defined as an unplanned interruption to a service or reduction in the quality of a service [Axelos 2019]) and problems (a cause, or potential cause, of one or more incidents; a known error that has been analysed but has not been resolved [Axelos 2019]): when you solve a problem, several incidents will not occur any longer and the problem is a root cause of an incident. Nonetheless, RCA is not a silver bullet because

  • it leads to long and complex analyses that require experts, action plans and documents; the amount of work probably grows exponentially with the complexity of the system
  • it is not easy to conduct - using a “5 Whys” or an Ishikawa can lead to false hypotheses or out-of-reach solutions (after all, everything is supposedly caused by some Great Architect of the Universe!)
  • causes may loop - then there is no root but a system of multiple intertwined causes [Appelo 2010], especially in complex systems [Snowden 2007]

This last item can be explained by Lorenz’ Butterfly Effect [Lorenz 1963] which lead Lorenz to state a butterfly's wingbeat in Brazil can cause a tornado in Texas in a sense that “over the years minuscule disturbances neither increase nor decrease the frequency of occurrence of various weather events such as tornados; the most that they may do is to modify the sequence in which these events occur” [Lorenz 1995]. That is to say, the wingbeat alone does not generate the tornado but it contributes to the catastrophic phenomenon. In complex systems, incidents and disasters are actually the consequence of a series of negligences that have been aligned to let the event and its consequences go through to a bad ending [Reason 2006]; each stage of the incident may see its magnitude become sufficiently large to become a visible anomaly or even a disaster [Moustier 2019-1]. This understanding helps to see that even if the situation seems hopeless, there are some possibilities, such as solving minor problems [Appelo 2010] just as with the Broken Windows Theory which aims to solve problems little by little. This standpoint leads to the importance of running 5S on code, on documentation and test cases and solving quick issues as soon as they appear as per a Kaizen approach.

To facilitate the handling of the bugs and limit the complexity of the big mess-in-progress, Microsoft’s “Bug Cap” can be involved [Bjork 2017]. Whenever the amount of bugs exceeds an upper threshold, the team focuses on eliminating them until a lower threshold is reached, similar to an hysteresis phenomenon [Moustier 2020].

Hysteresis phenomenon related to MS’ bug cap [Moustier 2020]
Hysteresis phenomenon related to MS’ bug cap [Moustier 2020]

In the same state of mind but at a higher level, the management of Google’s Error Budget generalized to technical debt will address the amount of “Broken Windows” on any aspects and keep the mess under a relative control.

Apart from any technique to fix issues in a more with less mindset, it appears that culture influences the way bugs are handled [Westrum 2004]. When the culture is favorable, inquiries and RCA are commonly used tactics. Eventually, if the culture is not proactive enough, issues are at least solved globally, wherever they exist (e.g. inspecting every aircraft to check for an issue found on an airplane). In less efficient organizations, fixes may be done locally only. Still in the direction of immaturity, defects can even be mitigated by communication with public relations that contextualizes the anomaly to reduce its impact. In his survey, Westrum has also noted that some cultures may go even worse in the bug treatment by isolating the message which tells about the issue, this step is named “Encapsulation”. Finally, the ultimate actions to deal with the problem by suppressing, harming or stopping the message bringing the anomaly to light by “shooting the messenger”. These last two extreme situations may seem insane, but there was an attempt in a well-known company in the airline industry. In this company, "Bug Bashing Days" were organized, small gifts and candy were offered to allow developers to eliminate problems in one day in a jolly good mood. After a while, top managers finally stopped the initiative because some managers “encapsulated” fixes until the event, to gain good ranking. When it comes to taking responsibility for a bug that can burn someone's career or any other high-stakes situation, it now becomes easier to imagine a "shoot the messenger" treatment.

Anomaly Response Maturity Levels [Westrum 2004]
Anomaly Response Maturity Levels [Westrum 2004]

Impact on the testing maturity

Kaizen teaches us to solve small issues whenever they appear provided you look for issues in the first place as per the first testing principle “Testing shows presence of defects” [ISTQB 2018] ! This leads to multiplying the types of tests on the solution notably with NFR to test or Monitoring of the solution to be aware of issues as soon as they appear or even spotting situations that lead to failures to limit unavailabilities.

Teams may also try to test the system resilience with voluntary disruption of the system techniques & tools. When considering the system as a whole, you may also consider how Teams would react when an issue appears unexpectedly and select a “Reversal” testing strategy like in security testing. In such situations the ability to do Andon and see how good and how fast Teams would gather around the issue as a US Task force notably thanks to a 8D process can be really interesting to see how effectively and efficiently issues are handled. From this standpoint, it now becomes culturally possible to properly “count” bugs in a product. In companies which disturbs the system to improve its resilience, it is possibly acceptable not to fix known bugs. This enables the “mark and recapture” technique from ecology to estimate an animal population's size. This science provides several program mark models for accurate estimates [Hoffman 2015] [Fishbio 2020] [Hightower 1984].

When bugs become hard to find internally, the organization can also use outsiders for bug discovery. This approach is called "Bug Bounty", often encountered in cybersecurity [Malladi 2019] or "Crowdtesting" for functional or compatibility testing. This bug handling policy is usually based on financial rewards related to the severity of the bug and would be particularly effective for hunting hard-to-find bugs [Bacchus 2017].

Agilitest’s standpoint on this practice

When test automation is systematically used to generate acceptance criteria in accordance with ATDD, bugs may then be automatically created from the automation framework but the false positive phenomenon usually discards this idea from Testers.

However, test script automation naturally prevents regression. This sort of intrinsic property should leads then Teams to automate scripts that would check if any fixed bug would not reappear. This stratagem is named “Defect-Driven Testing” (DDT) [Richardson 2005] and should be part of a bug handling policy [Crispin 2009].

To discover the whole set of practices, click here.

To go further

© Christophe Moustier - 2021