The concept of error budget
The concept of Error Budgeting (EB) has been popularized in the software industry by Google's DevOps vision called “Site Reliability Engineering” (SRE) [Beyer 2016].
EB has been introduced to allow the Product Owner or Manager (PO or PM) to balance quality issues with situations where rapid delivery is mandatory, such as a quick fix in production to deal with a complete system outage. Thus, EB is strongly linked to the availability of a service and this availability is often called "the nines of availability", the more there are 9, the higher the availability. Thus, an availability of 99% ("two nines") corresponds to a minimum availability of 23 hours 45 minutes and 36 seconds per day. To facilitate the representation, we indicate the duration of unavailability instead (i.e. 14.40 minutes per day). Usually, service unavailability is covered by service-level agreements (SLA), service-level objectives (SLO) to ensure SLA’s [Ross 2017] and the error budget is the amount of time you are willing to allow your systems to be out of service.
Availability may be improved in different ways such as replication, redundancy, auto-scaling, backups, and everything that makes systems more robust but robustness implies being conservative and thus innovation reduction; on the other hand, no one will buy a system that does not provide values [Meléndez 2021] or innovative feature while other products will provide more value. This is where EB comes into account for EB provides some space to let quality and innovation happen:
- when innovation is created, EB is lowered
- when consolidation is provided, EB is raised
and as a “good family man”, the EB should be kept positive with enough margin to handle severe issues that will introduce flaws along with the fix. EB is therefore related to risk management because it helps teams manage risk to [Watts 2020][Jankovski 2021]:
- Get the most out of their resources
- Improve service quality without sacrificing too much reliability
Moreover, to ensure SLA’s, Team members should collaborate to reach them by paying specific attention to relevant indicators with some higher margin, the SLO. Whenever the SLO threshold has been reached, Andon can be raised to eventually lead to a task force for instance in 8D flavor.
If the Teams are already used to working in task force mode, they will naturally efficiently collaborate around the issue that strikes the service objectives. By doing so, Teams will demonstrate they are responsible for the quality of the product.
Impact on the testing maturity
EB is not something created by Google. This term has been used for a long time in the design of complex aerospace systems since the domain requires extreme precision [Briggs 2008]. In this context, what is managed is a bit different from SLA and Customers’ satisfaction since this is related to the accuracy of some spacecraft; however, the concept remains the same: controlling the tolerance granted on some parts of the system to facilitate delivering without compromising the result. In the spacecraft domain, EB is introduced at least from design with multiple factors such as temperature, vibrations, material involved, joint modeling approximations, etc. This EB is a preventive approach to the errors that the spacecraft will face which seems smarter than simply facing a statement on a loss of quality to be fixed afterward.
Thus, in a “Prevent bugs” approach applied to the software industry, there are a lot of factors that may affect the availability of the service but also its accuracy or any kind of quality aspect any Customer could expect from a digital service. Those multiple criteria should deserve appropriate engineering techniques:
- Load capacity
- Capacity to deliver new features
- Service accuracy
- Service availability
Those factors may vary regarding the situation, time and most of all the competitive environment [Seth 2004]. These expected factors will usually be Non-Functional Requirements (NFR) or any Shift Left testing matter that would deserve testing and anticipation [Meléndez 2021]. Naturally, those shift left-related items could also be completed by Shift Right items and eventually perform some rehearsals on extreme situations such as server/network collapsing or even Disaster Recovery Plans execution.
EB must also take into account a specific point that is not particularly visible as SLA’s but that slyly interferes with the reactivity of a team and its ability to process requests, the technical debt because it slows down and eventually blocks new feature releasing and bug fixing. Therefore we may infer that technical debt is linked with the error budget [Hartmann 2020][Howard 2021][Gurumoorthi 2020] just like EB would rapid the technical debt. Moreover, remediation actions should be planned on a regular basis to avoid huge efforts to be done in a hurry and introduce extra issues: “Today's problems come from yesterday's solutions" [Senge 2010].
Thus, to avoid EB drops due to poor releasing, the Team should define a deployment strategy such a Dark Launch, Canary Releasing or Blue/Green deployment that are really useful to carefully deploy a new release but traffic analysis can also be used to limit the impact of a bad release if deployment happens out of busy hours to limit impacts on EB [Climent 2020].
Whatever your list of criteria, the eternal difficulty lies in the dilemma between urgency and importance. As a rule of thumb, you can simply keep in mind that important issues should be addressed continuously and incrementally with baby steps...
Moreover, in a Scrum Team, while Developers are responsible for maintaining the EB at a fair level, the Product Owner (PO) will be accountable for it; therefore, the PO should always ensure enough space is given to Developers to let them maintain the EB at a reasonable level.
Agilitest’s standpoint on this practice
Automation reduces the transaction cost [SAFe 2021-06] and avoids introducing regression-based errors in production, thus decreasing the EB. Whenever some automation is skipped, it should then generate some “automation debt”. that could also take part of the EB.
However, automation should also take care of
- its own technical debt [Wiklund 2012]
- impact on the delivery flow, notably if scripts generate false positives
Fortunately, thanks to its #nocode approach [Forsyth 2021], the amount of so-called “flaky tests” and technical debt is kept lower than with standard coding.
To go further
- [Beyer 2016] : Betsy Beyer, Chris Jones, Jennifer Petoff et Niall Richard Murphy - « Site Reliability Engineering: How Google Runs Production Systems » - O’Reilly Media - 2016 - ISBN-13 : 978-1491929124 - https://landing.google.com/sre/sre-book/toc/index.html
- [Briggs 2008] : Hugh C. Briggs - APR 2008 - “Model Error Budgets” - https://trs.jpl.nasa.gov/bitstream/handle/2014/41445/08-0818.pdf
- [Climent 2020] : Jesus Climent - JUN 2020 - “SRE error budgets and maintenance windows Google Cloud Blog” - https://cloud.google.com/blog/products/management-tools/sre-error-budgets-and-maintenance-windows
- [Forsyth 2021] : Alexander Forsyth – JAN 2021 - « Low-Code and No-Code: What’s the Difference and When to Use What? » - https://www.outsystems.com/blog/posts/low-code-vs-no-code/
- [Gurumoorthi 2020] : Hari Krishnan Gurumoorthi - MAY 2020 - “Site Reliability Engineering: 6 tips to regaining your error budgets” - https://www.devonblog.com/software-development/site-reliability-engineering-6-tips-to-regaining-your-error-budgets/
- [Hartmann 2020] : Andreas Hartmann - NOV 2020 - “Aging IT products, tech debt and error budgets” - https://www.linkedin.com/pulse/aging-products-tech-debt-error-budgets-andreas-hartmann/
- [Howard 2021] : Joshua Howard - JAN 2021 - “The Error Budget The Hardest Work” - https://thehardestwork.com/2021/01/21/the-error-budget/
- [Jankovski 2021] : Marin Jankovski & Rachel Nienaber - accessed on 19/AUG/2021 - “Engineering Error Budgets“ - https://about.gitlab.com/handbook/engineering/error-budgets/
- [Meléndez 2021] : Christian Meléndez - viewed on 19/AUG/2021 - “Why you need an error budget—and how to make it work“ - https://techbeacon.com/enterprise-it/why-you-need-error-budget-how-make-it-work
- [Reinertsen 2009] : Donald G. Reinertsen - FEB 2009 - “The Principles of Product Development Flow: Second Generation Lean Product Development” - isbn:9781935401001
- [Ross 2017] : A. J. Ross & Adrian Hilton & Dave Rensin - JAN 2017 - “SRE at Google SLOs, SLIs, SLAs, oh my Google Cloud Blog” - https://cloud.google.com/blog/products/devops-sre/availability-part-deux-cre-life-lessons
- [SAFe 2021-06] : SAFe – FEV 2021 - « s Principle #6 - Visualize and limit WIP, reduce batch sizes, and manage queue lengths » - https://www.scaledagileframework.com/visualize-and-limit-wip-reduce-batch-sizes-and-manage-queue-lengths/
- [Senge 2010] : Peter M. Senge - APR 2010 - “The Fifth Discipline: The Art and Practice of the Learning Organization” - ISBN:9781407060002
- [Seth 2004] : Nitin Seth & S.G. Deshmukh&Prem Vrat - JUL 2004 - “Service quality models: A review” - https://www.researchgate.net/publication/235286421_Service_quality_models_A_review
- [Watts 2020] : Stephen Watts - NOV 2020 - “Error Budgets Explained: Risk & Reliability in One Metric” - https://www.bmc.com/blogs/error-budgets/#
- [Wiklund 2012] : Kristian Wiklund & Sigrid Eldh & Daniel Sundmark & Kristina Lundqvist - APR 2012 - “Technical Debt in Test Automation” - https://www.researchgate.net/publication/254034665_Technical_Debt_in_Test_Automation