Technically Speaking: We Can Do Better

Tony Allen
4 min readJul 21, 2019

As engineers, we place a lot of emphasis on the efficiency of the things we build. That’s great, really, there’s nothing like reading through highly efficient code and knowing that our precious CPU cycles are not going to waste. However, I’ve observed that on the whole, we don’t approach our technical communications with the same regard. Support-related communication seems to suffer from this the most. I’m not claiming to be some expert on communication within large engineering organizations, but I do think I’ve given this topic more thought than the median engineer that’s shoveling coal in the fire.

The cross-team support model of the organizations I’ve been a part of seem to bias toward engaging people synchronously via Slack. While this model might work fine for smaller engineering orgs, as they grow larger (around the time I show up) it’s unscalable and the mental burden of these engagements is too much if lots of teams need to interact with your service. One can spend a great deal of time massaging every question into something actionable via follow-up questions.

I don’t think anyone would argue against removing friction from support-related engagements. After all, efficient engagements between teams allows everyone to spend less time on back-and-forth clarifications and more time on what we do best: building things.

Engaging other teams

When engaging another engineering team (we’ll call them our partners) for support, we typically need either information, a design or code review of some kind, or help with some engineering problem we are facing. What we want in these engagements is to get whatever we need from our partners as quickly as possible and to minimize the mental burden of the engagement. Sometimes the desire for immediate help eclipses the desire to reduce the mental burden on our partners, so our initial engagement looks something like:

An engineer asks for assistance.

Technically correct, but not exactly actionable. We can do better- I think it’s possible to get assistance quickly, while also reducing the burden of the inevitable follow-up questions on our partners.

A Contrived Example

When reaching out for help debugging an issue, we should seek to give as much information as possible in our initial engagement. This reduces the need for multiple round-trips in the form of follow-up questions. Here is an extreme example of a sub-optimal initial engagement:

DankMemeService has high CPU:<link to some dashboard>

Cool story. Your partner will immediately have questions that in one way or another attempt to answer ​what, where, when, the extent, the expected behavior, the observed behavior, and what steps you have taken to debug.​ Reframe the question to include all of this information. Remember, we’re trying to reduce the mental burden on your partner, and one way to do that is to make your question immediately actionable.

Let’s see what other info we can provide:

What:
DankMemeService

Where:
Instances in the staging environment.

When:
From 12:05pm 03/14/2019 to present time.

The extent:
Every host in the staging environment is affected.

Expected behavior:
The instances should hover at ~25% CPU utilization.

Observed behavior:
The instances abruptly spiked to 100% CPU utilization and stayed there.

Steps taken:
Ran ‘top’ and found that DankMemeService is redlining CPU. When looking at CPU traces, something seems to be happening in InteractWithCatGifService(). Peeked in CatGifService and DankMemeService runbooks and there is no entry for this scenario.

We can now morph our sub-optimal and not-really-actionable statement into something… wonderful:

All staging instances of DankMemeService abruptly spiked CPU to 100% utilization at 12:05pm 03/14/2019 and has continued this behavior ever since. This service’s steady-state CPU utilization typically hovers at ~25%. I looked at CPU traces on the DankMemeService process and found that something fishy is happening when interacting with CatGifService. I peeked in both the CatGifService and DankMemeService runbooks and there is no entry for this behavior. Can somebody assist in debugging this from the CatGifService side? See graphs X, Y, and Z:<relevant links>

Look at that! It’s clear why there’s a problem; there’s a deviation from expected behavior. We know why the partner is being engaged and what is needed from the partner. We even know exactly what graphs they should look at to see evidence of the problem.

TL; DR

Your initial engagement should contain detailed answers to what, where, when, extent, expected behavior, observed behavior, and steps taken so far. Providing this information will reduce the need for follow-up questions, allow your partners to begin helping right away, and make engineering orgs more efficient.

--

--