Monday 26 August 2019

Using Bowtie Diagrams to describe test strategy

Introduction

A long time ago in this blog post I was introduced to the Bowtie Diagram. I love how this visualises how we manage risks and I feel this compliments test strategy. Why? Well surely our test strategies should be accounting for risk and how we manage it. Whether you define testing in a specific focused sense (like functionally testing code) or in a holistic or broader sense (like viewing code reviews as testing, or monitoring or simply asking the question “What do end users want?”) - these activities are ways of either preventing or mitigating risks. I feel if you want to be effective in helping improve the quality of a project or product, you need to assess the potential risks and how you are going to prevent or mitigate them and therefore also assess where your time is best placed.
What are Bowtie Diagrams?
The short version - it’s a diagram that looks like a bowtie, where you describe a hazard, the top likely harmful event then list threats that will trigger the event and consequences from the event happening. For each threat and consequence, you describe a prevention or mitigation.



The long version, (and better more comprehensively described version) can be read here - https://www.cgerisk.com/knowledgebase/The_bowtie_method

What I love about these diagrams is that they more visually describe and explain risks and how we intend to manage them. Creating them is a useful exercise in exploring risks we may not have normally thought of, but in particular, I find we don’t explore the right hand side (consequences) of these diagrams very often. I find most of the time in software development that we are very reactionary to consequences, and even then, we don’t not typically spend much time on improving mitigation. 


Managing the threats
I’ve started using these diagrams to explain my recent approaches to test strategy because they neatly highlight why I’m not focusing all of my attention on automated test scripts or large manual regression test plans. I view automated test scripts as a barrier to the threat of code changes. Perhaps the majority of people out there view this as the biggest threat to quality, perhaps many people understand the word “bug” to mean something the computer has done wrong. These automated test scripts or regression test plans may well catch most of these. But are these the only threats?

I see other threats to quality and I feel we have a role to play in “testing” for these and helping prevent or mitigate them.

Managing the consequences
Any good tester knows you cannot think of or test for everything. There are holes in our test coverage, we constantly make decisions to focus our testing to make the most of precious time. We knowingly or unknowingly choose to not test some scenarios and therefore make micro judgements on risk all of the time. There are also limits to our knowledge and limits to the environments in which we test, sometimes we don’t have control over all of the possible variables. So what happens when an incident happens and we start facing the consequences? Have we thought about how we prevent, mitigate or recover from those consequences? Do we have visibility of these events occurring or signs they may be about to occur?
I find this particular area is rarely talked about in such terms. Perhaps there is some notion of monitoring and alerting. Usually there are some disaster recovery plans. But are testers actively involved? Are we actively improving and re-assessing these? I typically find most projects do not consider this as part of their strategy, in most cases it seems to be an after-thought. I think most of this stems from these areas typically being an area of responsibility for Ops Engineers, SysAdmins, DBAs and the like, whereas as testers we have typically focused on software and application teams. As the concept of DevOps becomes ever popular, we can now start to get more involved in the operational aspects of our products, which I think can relieve a lot of pressure for us to prevent problems even occurring.


Mapping the diagram to testing
An example of using a diagram like this within software development and testing:







I feel our preventative measures to an incident occurring are typically pretty good from a strategic view, especially lately where it has become more and more accepted that embedding testers within development teams improves their effectiveness. Yes, maybe we aren’t writing unit tests or involving testers in the right ways sometimes still. But overall, even with such issues, efforts are being made to improve how we deliver software to production.

But on the right-hand side, we generally suffer in organisations where DevOps has not been adopted. And when I say DevOps, I don’t mean devs the write infrastructure-as-code, I mean teams who are responsible and capable of both delivering software solutions and operating and maintaining them. Usually, we see the Ops side of things separated into its own silo still and very little awareness or involvement from a software development team of their activities. But Ops plays a very key role in the above diagram because they tend to be responsible for implementing and improving the barriers or mitigations that help reduce the impact of an incident.

I feel the diagram neatly brings this element into focus and helps contribute to the wider DevOps movement towards a holistic view of software development, towards including aspects such as maintenance, operability, network security, architecture performance and resilience as qualities of the product too.

As testers I feel we can help advocate for this by:

  • Asking questions such as “if this goes wrong in production, how will we know?”
  • Requesting access to production monitoring and regularly checking for and reporting bugs in production.
  • Encouraging teams to use a TV monitor to show the current usage of production and graphs of performance and errors.
  • If you have programming/technical skills, helping the team add new monitoring checks, not dissimilar to automation checks. (e.g. Sensu checks)
  • Becoming involved with and performing OAT (Operational Acceptance Testing) where you test what happens to the product in both expected downtime (such as deploying new versions) and disaster scenarios, including testing the guides and checklists for recovery.
  • Advocating for Chaos Engineering.

No comments:

Post a Comment