The lost art of checking failed CI jobs

October 02, 2020
thought piece

a movie still from Karate Kid from 1984 — Karate Kid, 1984

Have you ever found yourself at either end of the following interaction between 2 or more persons:

“Can you fix my Jenkins job?”

Person A: Hello

(* waits for Person B to reply)

Person B: Hey what’s up?

Person A: My Jenkins job is failing. Can you check?

Person B: Ok… can you be more specific?

Person A: the job to build XYZ. please check quickly, I am blocked by this!!!!!!!

Person B: Ok… can you give me a link?

Person A: it’s on Jenkins… but ok, give me 1 minute

(* 25 minutes pass by)

Person A: here is the link: …

(* provides the link to the Jenkins job but not to their affected build log)

Person A: please do something about it, our team is blocked by this!!!!!!!

“Can you fix my problem(s)?”

There’s a good chance you’ve seen and taken part also on different flavours of the above dialogue. And the kicker most times is the dialogue is not guaranteed to be on something that is true to be deemed “mission-critical”.

These dialogues are done between someone who poses a problem but poses it barely, with little detail, insight, or previous homework, and another person, who will act as a problem-solver (and indirectly as the effective problem-poser). This latter is more often than not already pursuing a task, and their time ends up being diverted to a great extent, being left with feelings that the former person could have at least eased up the task to get closer to a solution to the problem.

Some of the most common day-to-day flavours are:

The tester that says, enigmatically “something is failing”, and right away leaves the dialogue in a cliffhanger and goes do something else;
The developer who “doesn’t know” how to use a simple tool to perform a task and then asks another developer to do the task instead;
The manager/product owner/agile-something that asks the status of a ticket without the intention actually knowing the status (and usually the ask itself serves no purpose other than a psychosomatic outburst);
The staff engineer or tech lead that has no foundations towards basic development tooling, and will ask recurrently the same questions to interns and juniors;
…

Besides the clear faulty (or weird) communication flow, and perhaps lapses/faults in knowledge, willpower, or trust of one of the parties involved in the problem discussion, the problem is a face of a much bigger meta-problem that encompasses the initial problems.

I’d like to go over what I think the different faces of the bigger meta-problem might be, in no particular order:

Lack of documentation

The most naive of us, or read, those of us who haven’t been bruised by a few “career-limiting moves” here and there, mixed with typical “tribal” intra-team/inter-team quarrels that happen a lot in the corporate world, will go forth and say:

Surely the problem (and solution) lies in the lack of documentation!

Fair enough. We all know our “ABCs”:

Documentation in tech projects is not easy;
Good documentation in tech projects is hard;
Good and meaningful documentation in tech projects is of “Call Of Duty Realism” difficulty;

Now, let’s challenge this lack of documentation bit for a moment.

It’s hard to document if the architecture and process itself is confusing.
If it’s not confusing, it could be the case that, say, for a newcomer (or even a senior tech person) the documentation isn’t easily accessible. So I can’t necessarily blame someone for not being able to check by themselves build logs, or application logs, or how to go about building something on their machines, and many more actions, if for any of these typical tech project actions the documents only show that the actions are confusing, or the documents are inaccessible.
Let’s say the documents are supposedly clear and accessible, but not fully “translatable” without some other bits of implicit knowledge. Sometimes it’s that API token that only some old employees, external employees, and maybe some 15-year-old hackers have, but you can’t find anywhere or are not given permissions straight away. Sometimes it’s a password for an URL that hasn’t been changed since 2005, but only some employees know about it. Or maybe a URL that changed recently, or a build command, or a version,…

Who knows these things, right? So lack of documentation might be a false clause or a misrepresentation sometimes of the problem. Let’s proceed.

Clogged communication pipelines

There’s a saying somewhere on the internet that the communications and structure of projects in companies are a raw mirror of the underlying culture and organisational design of the company. A mirror that shows things for what they are, and more often than not, are not what is on the default “culture Powerpoint decks”.

I’ve seen this up and close, and personally think many of us say that this is somewhat the case.

We’ve all one way or the other been a part at one point in our lives of a dysfunctional organization, and we notice that the communications and codebases often mimic that disfunction, with typical examples being:

long email threads over nimble decisions;
“red tape” attitude over open source/free software mindset;
over-compartmentalization over teamplay structure;
gaslighting and dissonance over reality about topics that can range from the most mundane, like small technical decisions, to serious topics like employee retention;
rituals and zealous formality over healthy discord and understanding;
…

The list goes on. A few of us are fortunate to get a glimpse over the fence, and fewer of us are fortunate to experiment and live through better realities, but the point remains - all of these traits of disfunction in organisations, not only cause bickering on different areas but clog the simplest and most mundane of communications, including generating (most times) needless back-and-forth dialogues like the ones described at the start of the post.

But let’s say this is not your case, and at best you find this face of the problem of total anecdotal fallacy, and you believe it to be an isolated example that only grumpy people online have to withstand.

Lack of trust, willpower, and habit

Let’s try and put ourselves in the shoes of Person A from the dialogue above. If we’re by default warmhearted, and we believe Person A is well-intended, then a lot of logical points could justify the lack of “homework” in looking into/presenting their own problem. Acting professional on the job doesn’t permit us to act like pricks. We’re still expected to be humane.

Unless…

Would we still be open to aid folks in this kind of problem… after continuous and prolonged lack of “homework”?

We don’t mind helping once or twice. But what about repeated interruptions with repeated problems? We all have our limits, in the end, even the most gracious of us.

We have to turn to the reason for the repetition to gain some insight:

Maybe stuff is not documented or always confusing;
Maybe it’s due to the way communications happen in that organisation;
Maybe there are no incentives for the problem-poser to look deeper;
Maybe the person with the problem (or the one solving the problem) is by nature lazy;
Maybe one or both persons are mean and set on sabotage;
Maybe one or both of the persons involved don’t trust each other;
Maybe one or both of the persons don’t value each others time;
…

On a personal level, I believe that out of all these reasons, the main one is lack of incentive. Incentive is a powerful thing if done right, and can surpass barriers related to lack of trust, willpower, and even lack of good problem-solving habits.

I know a lot of good hard-working folks, the kind of folks who sweat blood and tears for projects and don’t waste much time advocating weird management technologies or bland ideas (“agile == speed”, KPI/OKR orgies, shallow publicity on collaboration and transformation…) that will fall into lazy patterns when posing problems, and it’s (almost) always a matter of incentive.

Think of all the technical problems you had in the past, in retrospect: if for every problem you had an active incentive & encouragement in owning up to your problems, with positive feedback when you find the answer for yourself and document it for others to come, and you exhaust a healthy and sane amount of options when studying a problem before interrupting someone else’s flow… would you still interrupt people in the end as many times as you did?

“Where have all those “tool-lobbyists” aka “computers-are-bicycles-for-the-mind” folks gone?”

J’accuse! – Ted Mosby, Architect in “How I met your mother”

A peculiar pattern with this whole meta-problem is that, past the moment when “the fire nation attacked” (aka when we need them the most), the lobbyists of the “united digital renovation tool-enforcers guild”, a species that finds its habitat living next to big multinational corporations ~~tit~~ budget, that should be trying to solve said meta-problems with tools: are ~~reachable through a PO Box in Panama~~ nowhere to be found.

Jokes aside, a lot of the questions of

Why did this CI job fail
Why did this endpoint call fail
What happened that this event is in a weird state
How do I do version/release/merge
Where can I look for information
…

most of these issues of could be solved with tools.

Inexpensive tools.

Tools that aren’t made by tool-making companies preying on overlooked budgets.

Tools built by users and not salespeople and dealers.

Tools that make the opensource-software and free-software (free as in freedom) communities weep with joy.

(Software) Toolmakers of the world unite!

I agree people might have mixed feelings about this topic, and I dare not put all toolmakers in the same bag!

I invite you to consider the environment that we live right now with regards to tech organisations:

There are those adapting/adapted to a remote-work status that are very mindful of the tooling they use to guarantee healthy-ish work conditions;
There are those that for a while took to social media publicizing total success in adapting to remote-work conditions, and are now secretly suffering the self-induced tragedy of burnout due to even thinner work-life boundaries;
And there’s even a few of these that are ~~forcing~~ encouraging “heroes” back to the office… (the “roof cleaning“ scene from Chernobyl series comes to mind);
…

Regardless of their relationship with remote-work policies, take note on the relation between these organisations and their use of software tools:

Some of them will transition their focus to helpful tools, that help make for saner development remote processes;
Some of them will double-down on their focus on useless/lobby-provided tools;
Some of them will double-down on their reliance on control/surveillance tools;
…

Now more than ever is the time for fighting for saner tooling and humanist toolmakers. We need tools that help our development process and reduce the kind of dialogues like the one above, giving space and time for creative and deeper dialogue.

The inverted scenario

There’s an extra “flavour” of the dialogue that I didn’t mention up until now, that takes a mixture of all of the above meta problems: the “inverted scenario” of someone posing an actual serious problem, going to a certain length explaining it in detail, laying the common ground… and the other person acknowledges the problem only to act out right after the meme “I must go now, my people need me!” or adding you to their “mental .gitignore”.

This one is a catch-all scenario.

A somewhat sad reality will kick in through time: even if in a team, you and your friends make it part of your character values to invest in meaningful documentation, context-adapted effective foundations, unclogged communication pipelines, ultimate trust, willpower, and habit… you’ll soon find out that there’s somewhere inside (or outside) the organization a living antithesis to these values. Sort of an organisational ode to the “darkest timeline” in the Community series.

Examples:

The colleague who is (or plays) lost on how to debug an issue, “is blocked” no matter how much useful debug info you provide;
The manager who is (or plays) unaware, or continuously surprised, towards fixing deep-rooted cultural problems;
The “other team” that is clueless on how to solve a problem of theirs, even if you give them multitudes of evidence of what the problem is;
Explaining in transparent detail foundation-shattering issues with suboptimal “corporate employee retention” practices (if they do exist), to be met with an old-man-shrug.jpg meme by senior leadership or folks in charge of human nuts and bolts departments;
…

I don’t have a good solution or follow-up to this last scenario. It’s said “it takes two to Tango” so one day we’re the bad guys, some other day we’re the good guys. The world is a messy and beautiful place. If anything, I think being aware of this meta-problem is a first-step at solving it, in whatever context we are on. Care to take a second step? Let’s do it together, at 3:

One.

Two.

a moving GIF image of a pirate running away taken from one of the Pirates of the Caribbean movie

Thank you for reading. Feel free to reach out to me with comments, ideas, grammar errors, and suggestions via any of my social media. Until next time, take care! If you are up for it, you can also buy me a coffee ☕