Error Budget


Balance low quality releases with operational circumstances



The concept of Error Budgeting (EB) has been popularized in the software industry by Google's DevOps vision called “Site Reliability Engineering” (SRE) [Beyer 2016].

EB has been introduced to allow the Product Owner or Manager (PO or PM) to balance quality issues with situations where rapid delivery is mandatory, such as a quick fix in production to deal with a complete system outage.  Thus, EB is strongly linked to the availability of a service and this availability is often called "the nines of availability", the more there are 9, the higher the availability. Thus, an availability of 99% ("two nines") corresponds to a minimum availability of 23 hours 45 minutes and 36 seconds per day. To facilitate the representation, we indicate the duration of unavailability instead (i.e. 14.40 minutes per day). Usually, service unavailability is covered by service-level agreements (SLA), service-level objectives (SLO) to ensure SLA’s [Ross 2017] and the error budget is the amount of time you are willing to allow your systems to be out of service.

Availability may be improved in different ways such as replication, redundancy, auto-scaling, backups, and everything that makes systems more robust but robustness implies being conservative and thus innovation reduction; on the other hand, no one will buy a system that does not provide values [Meléndez 2021] or innovative feature while other products will provide more value. This is where EB comes into account for EB provides some space to let quality and innovation happen:

  • when innovation is created, EB is lowered
  • when consolidation is provided, EB is raised

and as a “good family man”, the EB should be kept positive with enough margin to handle severe issues that will introduce flaws along with the fix. EB is therefore related to risk management because it helps teams manage risk to [Watts 2020][Jankovski 2021]:

  • Get the most out of their resources
  • Improve service quality without sacrificing too much reliability

The EB facilitates organizations to focus on the customer, organize around value and adopt an economic view.

Moreover, to ensure SLA’s, Team members should collaborate to reach them by paying specific attention to relevant indicators with some higher margin, the SLO. Whenever the SLO threshold has been reached, Andon can be raised to eventually lead to a task force for instance in 8D flavor.

If the Teams are already used to working in task force mode, they will naturally efficiently collaborate around the issue that strikes the service objectives. By doing so, Teams will demonstrate they are responsible for the quality of the product.

Impact on the testing maturity

EB is not something created by Google. This term has been used for a long time in the design of complex aerospace systems since the domain requires extreme precision [Briggs 2008]. In this context, what is managed is a bit different from SLA and Customers’ satisfaction since this is related to the accuracy of some spacecraft; however, the concept remains the same: controlling the tolerance granted on some parts of the system to facilitate delivering without compromising the result. In the spacecraft domain, EB is introduced at least from design with multiple factors such as temperature, vibrations, material involved, joint modeling approximations, etc. This EB is a preventive approach to the errors that the spacecraft will face which seems smarter than simply facing a statement on a loss of quality to be fixed afterward.

Thus, in a “Prevent bugs” approach applied to the software industry, there are a lot of factors that may affect the availability of the service but also its accuracy or any kind of quality aspect any Customer could expect from a digital service. Those multiple criteria should deserve appropriate engineering techniques:

  • Load capacity
  • Capacity to deliver new features
  • Service accuracy
  • Service availability

Those factors may vary regarding the situation, time and most of all the competitive environment [Seth 2004]. These expected factors will usually be Non-Functional Requirements (NFR) or any Shift Left testing matter that would deserve testing and anticipation [Meléndez 2021]. Naturally, those shift left-related items could also be completed by Shift Right items and eventually perform some rehearsals on extreme situations such as server/network collapsing or even Disaster Recovery Plans execution.

EB must also take into account a specific point that is not particularly visible as SLA’s but that slyly interferes with the reactivity of a team and its ability to process requests, the technical debt because it slows down and eventually blocks new feature releasing and bug fixing. Therefore we may infer that technical debt is linked with the error budget [Hartmann 2020][Howard 2021][Gurumoorthi 2020] just like EB would rapid the technical debt. Moreover, remediation actions should be planned on a regular basis to avoid huge efforts to be done in a hurry and introduce extra issues: “Today's problems come from yesterday's solutions" [Senge 2010]. 

Technical Debt is repaid by the Error Budget
Technical Debt is repaid by the Error Budget

Thus, to avoid EB drops due to poor releasing, the Team should define a deployment strategy such a Dark Launch, Canary Releasing or Blue/Green deployment that are really useful to carefully deploy a new release but traffic analysis can also be used to limit the impact of a bad release if deployment happens out of busy hours to limit impacts on EB [Climent 2020].

Strong automation and small delivery batches (Incremental approach) strategies also reduces impact on EB because it enables fast fix deployments and reduces the risk of facing huge negative impacts.

Whatever your list of criteria, the eternal difficulty lies in the dilemma between urgency and importance. As a rule of thumb, you can simply keep in mind that important issues should be addressed continuously and incrementally with baby steps...

Moreover, in a Scrum Team, while Developers are responsible for maintaining the EB at a fair level, the Product Owner (PO) will be accountable for it; therefore, the PO should always ensure enough space is given to Developers to let them maintain the EB at a reasonable level.

Agilitest’s standpoint on this practice

Automation reduces the transaction cost [SAFe 2021-06] and avoids introducing regression-based errors in production, thus decreasing the EB.  Whenever some automation is skipped, it should then  generate some “automation debt”. that could also take part of the EB.

However, automation should also take care of 

  • its own technical debt [Wiklund 2012] 
  • impact on the delivery flow, notably if scripts generate false positives

Fortunately, thanks to its #nocode approach [Forsyth 2021], the amount of so-called “flaky tests” and technical debt is kept lower than with standard coding.

Related cards

To go further

© Christophe Moustier - 2021