Resiliency and Fault tolerance in Testing

January 17, 2022
personal experience, resilience, and testing

a still from Seinfeld show, 1989-1998 — Seinfeld, 1989-1998

How do we look at failure when the system (or subsystem) we are observing is: “how do we build a testing process”?

A lot of software testers may come across different recipes for a test process, for example:

Creating a bulky test plan, with an emphasis on coverage “measuring quality”. You attack the problem of Test in a system with a ton of scripted test cases;
Bulky test plans fail to scale. The system becoming clogged with embarrassing bugs and noise. Someone decides to invest in automation;
Automating the tons of scripted test cases ends up expensive, read, “low reward for a dumb high price”;
In super dysfunctional cases: fake initiatives, dashboards and pipelines are set up. Signals of “X out of Y is passing” conveyed in a beautiful fashion. Embarrassing bugs still prevailing, but at least “ah, hurray! quality metrics are visible!”

The schools that generate these and other counter-approaches, all miss the mark.

Some schools will focus on positive and reassuring ideas on quality and teamwork. Most of the ideas end up being in total dissonance with the world that we live in:

a religion of thousands of denominations, built on the premise of move slow, break stuff, pay later.

We all know its name.

How many of us abide by the idea that the teams we work with are agile? When do we stop to realize said teams are sick with the tyranny of structurelessness? Or succumb to bureaucracy mixed with recurring finger pointing and fear mongering.

Some schools will advocate for soul-less forms of testing. They cater to a world that doesn’t care for craft. Organizations want reassurance they are following the international standards.

They can’t fail if they follow the standards to the letter, right?

Many know the reality that these same standards are shallow and spine-less. Like a self-fulfilling prophecy: recipes advocated by these entities end up embarrassing failures.

Some schools struggle to have a marketable voice. The world has a hard time, can’t suffer to slow down and smell the flowers of “refined” testing. The world is “cruel and cold” and instead demands painless solutions for two sets of problems:

The problem of annoying embarrassing bugs
The problem of lack of meaningful automation and modern accessible test tooling

Sure enough, said schools will excel at showing (selling) you an approach for the first problem. As tragic fate would have it, they fail to face the second one. At least through the lens of prospective organizations.

Some schools are full of good intentions. Their focus will be in improving business and delivery “using data”! Optimizing for “meta” team quality problems! Completely missing what testing is and what core problems testing intends to solve. Those problems don’t matter. Modern positive attitude matters. “Let’s get together and (you folks) solve things… While I guide you with The Force”.

There’s something in the market for everyone, and a voice for every ear. And yet, in spite of all the solutions all schools of thought might offer there is failure. The common tester will notice through time that most approaches will fail one way or the other:

Dimming meaningful problems and solving dummy/placeholder problems;
Failing to observe the encompassing systems that are failing;
Not adapting to a system that goes from one disfunction to the next;
Failing to survive on their own in frail contexts where projects are “run by idiots for idiots”;
…

Then come the symptoms and idiosyncrasies that test engineers face through time:

Missing an embarrassing bug;
Being blamed for missing an embarrassing bug;
Having an overload of embarrassing bugs;
Dealing with an overload of items to test
Dealing with idiotic release tempos and childish expectations;
Having trouble with flakiness when testing a system;
Suffering an untestable system;
Hiring more testers when more testers does not equal more impact on test problems;
Not being able to hire testers when needed. Be it through market issues or inner-organizational issues/politics;
Resourcing to sweatshop testing companies, later realizing these have also no impact;
…

The underlying premise is the same as with all “approaches that miss the mark”: No one is looking at failure in the system. Or observing the system itself. No one is sitting behind the wheel. The tragedy of our century:

Testers failing to test themselves.

What can we do about it?

There’s a handful of things that I’ve tried. Some of them turned out successful, some failed in specific contexts, at times. So full disclaimer:

What follows is based on a collection of personal lessons, likely has a lot of holes, and can be hypocritical…

Hopefully these might help others, since some of these helped me making systems more resilient/fault tolerant. At least in the domain of software testing. In no particular order:

Avoid logical falacies
Watch out for sunk cost
Systematically deconstruct biases
Add guardrails to the system
Let Chaos follow its path

Let’s look at each of these a bit in detail:

Avoid logical falacies

Plenty of folks will appeal to authority, “this is how we do testing, because the Pope of quality, Pontifex Maximus Testingus says so”.

Some folks will repeat that their approaches towards testing or their prefered framework are ideal because their approaches towards testing and prefered frameworks are ideal.

Some folks will plead their case, saying that maybe the current testing approach doesn’t work because folks didn’t believe in it right.

People will attack successful testers suggestions because no good can come from testers that smell bad and come from the north.

Besides, we should follow what some dude that worked at a subset of teams, say, at Big Tech Corp said. He has likely never meaningfully tested anything in his life. But he wrote a book about the subject. It must have worked, because Big Tech Corp is sucessful. Plus, those big tech companies are all known for having zero embarrassing bugs, having users’ best interest in mind and not being evil…

Are you getting the idea here?

Watch out for sunk cost

If you’re doing something, and it’s not working for a while, but it’s still early to say, push a bit further.

If you are doing something that is the same that has been done for months or years, or is different but is just another way of avoiding problems: stop sinking your time and resources into it. The horse is dead.

Systematically deconstruct biases

Picture someone that one day turns up to you and says:

“it was revealed to me in a dream that from now on we need to automate all the Testing”.

Most craftsman testers attitude towards this (cocaine-induced) troll bait is to start washing dirty laundry in public.

Don’t waste your breadth. Don’t waste energy by taking that as an opportunity to tell them that “testing can’t be automated, testing is testing, you’re stupid and ignorant and stupid”.

Stop, and instead of taking the moral superiority road, which is getting us nowwhere, take a step back and deconstruct:

Why does this person have this belief/bias towards something like testing?
Why do folks have wrongful ideas towards what quality assurance and testing really are?
Why do many folks have beliefs and will trust cashgrabs like “AI-driven testing”?
…

We never take a step back to understand why and how these are set up in the person’s mind in the first place. You waste less energy deconstructing and studying biases than replying directly to them in a counterattack flamewar.

Deconstruct a wrongful concept’s origin first before trying to correct someone. This will force you to take a step back and look at the entire system. A system that induced the person to think in that way for a reason.

Add guardrails to the system

Stick with pure-ish guiding principles that will get you some hygiene-guardrails in the long run:

Avoid cognitive bias (check out Charity Major’s post about Software deploys and congnitive biases to get more examples);
Avoid blind acceptance of ideas, tools, frameworks (linked with logical fallacies), and make up your tool decisions based on your own guiding principles (check this post where I talk about a few principles for choosing a test framework);
Prefer adapting to context over standardization;
Study failure actively and sistematically;
…

Keep the principles simple and easy to understand. Act on the principles in abundance.

Let Chaos follow its path

If there’s anything that Jurassic Park has taught us is that life will always find a way. Sometimes we must let life do its part, follow its course.

Think of all the systems, as in, tech organizations, development, integration and deployment processes, etc., that are by design broken.

They’re broken and yet, they’re supported by many willing/unwilling participants. If they are maintained in a broken fashion for a long while, despite our attempts to get them better, it might mean someone in power is benefiting from the broken system.

And someone is being taken advantage of.

So, sometimes it’s better to know when to leave. Know when to let extremely broken systems self-destruct themselves.

If you read this far, thank you. Feel free to reach out to me with comments, ideas, grammar errors, and suggestions via any of my social media. Until next time, stay safe, take care! If you are up for it, you can also buy me a coffee ☕

Special shout-out to my friend Jorge for proof-reading and giving me new ideas while I drafted this post.