I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycomb’s incident retrospective process from the inside. I wasn’t disappointed!
The first retrospective I took part in was for our ingestion delays incident on September 8th. Our preliminary report promised that we’d post more about what happened after our investigation concluded, and the retrospective meeting that I attended was part of that work. Later on, we posted our full analysis.
Right at the start of the retrospective meeting, Fred Hebert blew my mind by reading out the Ground Rules, which I’ll paraphrase here:
Our main objective is to learn and get a better understanding of what happened and what it could mean. We strongly believe that coming up with useful practical improvements is difficult without a good understanding of the aspects of our system that challenged us, and this is what we want to focus on here.
If you have improvement ideas or action items, I would encourage you to note them down for later rather than raising them during this meeting. Once we have a better understanding of this incident, action items usually surface organically.
We’re going for blame-aware incident reviews; we are here to assume people wanted to do a good job, and that they did the best they could to meet objectives. When questions arise about why someone did something, we prefer to focus on why it made sense at the time to take that action.
We’d like to avoid thinking about “what we could have done differently” and instead re-frame that into “what can we do next time to get a better outcome?” It’s a minor shift in perspective, but it helps us be more constructive in our viewpoints.
Ask questions! We’ll maintain a steady progression through the meeting, but there should be room for questions.
If you think something is obvious to others but not to you, ask about it. People tend to have similar questions, and these can highlight unspoken assumptions about how we do work. You can message me privately in Zoom’s chat function if you want your name to remain confidential.
If you have feedback about how we ran the session, we’re happy to receive it.
There’s so much to love in this intro! I’ve been learning about these concepts for years and trying to slowly incorporate them into the incident retrospective culture around me. I was pleasantly surprised to hear that these ideas were already firmly instilled in Honeycomb’s culture.
Let’s look at the ground rules in a little more detail to find out why.
Learning vs. action items
I first came across this concept in the Etsy Debriefing Facilitation Guide, and since its publishing, I’ve watched long-standing best practices shift toward an emphasis on learning versus action items. The Howie guide for post-incident analysis by Jeli is another example of an incident analysis framework that embodies this idea.
I have to admit, my thinking on this topic has changed over the past few years. Heck, I co-led an entire conference session on running incident retrospectives that held remediation items as the main goal. However, I now see that we learn so much more when learning is the focus. Searching for remediation items actively gets in the way.
Blame vs. context for decisions
“Why did it make sense to make that decision?“ Ask this question in an incident review and you’ll learn more about your sociotechnical system. This one question sets the tone, making those involved in the incident feel safer because they know that everyone is assuming they made the best choice they could at the time based on the information they had.
It’s worth noting that we don’t say “blameless” directly. Instead, we use “blame-aware.” It’s okay to talk about who did something, provided that the discussion is sanctionless; no one is going to be punished for decisions they made in good faith.
In an incident review, a counterfactual question asks, “What should we have done?” This kind of question is dangerous because it conjures up a reality that did not exist. In the process, it brings undertones of blame that will engender defensiveness and stifle the investigation. By phrasing our questions in the form of how we can act in the future, we acknowledge the reality that everyone did the best they could during the incident.
Finally, the ground rules encourage asking questions, even if the answer seems obvious. An incident review is about finding out where our mental models of the system broke down—and bringing those models closer into alignment with the way the system actually works. Everyone’s model is an approximation, and a different one at that. Your question helps you improve your mental model, and almost certainly will help someone else too. Ask it!
Using the ground rules
Creating and publishing the ground rules for incident investigations is the first part, but that’s not enough. I experienced firsthand how important it is to read them aloud before every single retrospective meeting.
In any meeting, chances are there’s someone new who hasn’t heard the rules before. For those who’ve heard them before, it provides an important reminder that tone and mindset are critical to promote learning as much as we can from each incident. The end result is to create an inviting learning environment where everyone feels safe contributing and we all get to learn together as a group.
I’ll end this article by telling you that we’re hiring! If the culture at Honeycomb sounds like it’d be a good fit for you, check out our careers page and see if there’s a match for you.
They’re everywhere. In Slack: “hey, can I get a review on this?” In email: “Your review is requested!” In JIRA: “8 user stories In-Progress” (but code-complete). In your repository: 5 open pull requests. They’re slowing your delivery. They’re interrupting your developers.
How can we get people to review pull requests faster??
Let’s face it: nobody wants to review pull requests. And for good reasons! It takes a lot of time and work. Chelsea Troy describes how to do pull request review right:
In addition to pulling down, running, and modifying the code myself… A maximally effective pull request suggests solutions…in code. It points out what’s working and what’s not, and links to documentation where useful. It highlights laudable work by the original developer, and asks questions before making assumptions or judgments. It explains the reasoning behind suggestions, whether that reasoning affects functionality or adheres to convention. In short, it demands the reviewer’s full participation in finishing the solution that the original developer started. And it prepares the reviewer to take responsibility for this code in the event that the original developer were unable to complete it.
Reviews, done right, have all the painful parts of a software change: understanding what the change is for, loading up the relevant code and tests into working memory, getting my local environment up to see the change, making the tests run. They have none of the fun parts: refactoring to clarity, changing code and seeing a difference. They take hours of time and all my concentration away from whatever it is that I’m personally trying to do.
On top of that, they’re a social interaction minefield! This variable name confused me at first but now I see why they called it that. Should I suggest a change, and require the other developer to do a whole context switch again to improve it? Probably an asshole move. This test doesn’t cover all the cases; I can see one that’s missing. Request another, like the pedant I am? or figure out how to write it myself, adding another hour?
There’s a cost to every comment, a cost to the submitter’s sense of belonging. A responsible reviewer looks at consequences far beyond the code.
Of course I never want to review pull requests. It’s mentally taxing, takes a lot of time, might damage relationships, and gets me nowhere on the task that has my name on it.
So the twitterverse is asking, how do we get people to do it anyway?
If this is what we’re asking, maybe something is wrong with our priorities.
Maybe we’re asking the wrong question.
What does it say about us that no one wants to review pull requests?
Maybe it says that we trust each other.
Maybe it says that our team has too many concurrent tasks. And by “too many” I mean “more than one”!
We use pull requests to ensure code is understandable by the whole team.
What is our goal with this pull request process? There are several, but I think the primary one is: safe, understandable code. It looks safe to deploy, and it is clear enough to be understood by the rest of the team. Tests can give us confidence is safety, but only a person can evaluate “understandable.”
To change code, a developer first has to understand the code, and understand the change. If the developer was the last person to change this code, then they just have to load it into memory. They’ve understood it before. This should also be true if they reviewed that last change — pull request review spreads that understanding a bit.
A developer gathers this knowledge, then uses it to make decisions about the code. They probably iterate on it a few times, and then they submit something they consider safe and understandable.
But is that code really safe and understandable?!? We must ensure it! Let’s add this whole process again, except the decisions are approval instead of what to change. We’ll make this asynchronous, yeah, so the submitted can start a whole different task. And if the decision is “no” then we’ll make another asynchronous task and everybody can context switch again!
This defies everything we know about product development flow. We just increased WIP and slowed our response time by adding a wait into the process (at least one wait, really an indeterminate number).
To improve flow, eliminate queues.
Like Patricia said, maybe this process developed for open-source projects isn’t the best for our in-house teams. Maybe there are better ways to work together.
The pull request process results in code that two people understand. What if we aimed higher?
Maybe instead of trying to work a bit more together, we could work together.
How about: the team makes all code changes as a unit. Ensemble working (the practice formerly known as mob programming), with one shared work product and all the shared knowledge. It will be as safe as everyone can make it, and more than understandable: it’ll be understood by the whole team.
Not every team member will be present every day. Let’s take a page from distributed systems and require a quorum of team members present when we make code changes. At least 2 developers on a team of 3, at least 3 on a team of 5, etc. That way, whenever it’s time to change that code again, someone present was involved in the most recent change.
Then there are no queues or waiting, only collaborating on getting the best name, the complete-enough test suite. Every refactor increases the whole team’s understanding of the code. The team develops a common understanding of the code and where it is going, so they can do gradual improvements in a consistent direction.
Does that sound inefficient? Consider the inefficiencies in the queuing for pull requests, the task switches. Not to mention the merge conflicts we get after the pull request sits open for days.
Does it sound wasteful? All that programmer-time dedicated to just one task, when we could be doing three! Well, ask: which of those three is the most important? Why not get that out as quickly as possible and then work on the others? And it is faster, when you never have to ask permission or wait for answers because all the relevant knowledge is right there. (It helps to bring in other people too, when you need knowledge from an adjacent team or specialist.)
Does it sound miserable? Many people hate pair programming; this sounds even worse. Strangely though, it’s better. When there are three or more in a session, there’s less pressure to stare at the screen every second. One person’s attention can wander while the group attention stays. A person can go to the bathroom or answer an urgent question on Slack, while the ensemble remains an ensemble. Pair programming is more exhausting.
Does this seems like an all-day meeting? No, only when we’re changing code. There’s a lot more we do in a day. There’s still email! Each of us has knowledge to acquire and knowledge to share with other teams. I only have six hours of focused brainpower in me on a day. I’d aim for five hours of direct collaboration, and not change production code outside of it.
Does this seem impossible remote? It is harder. Set up a shared development environment that everyone can connect to for switching. Or start a branch and use git to move code around. Turn your video on, but set up a screen and camera over to the side, so that looking at each other is different from looking at the code. Staring at each other is draining. Working alongside each other is invigorating.
(TODO: take a picture)
Is your team too large for this? It does get ridiculous with 8-12 people in one meeting. That’s a smell: either your application is too big (it takes that much knowledge); can you split it? Or, someone thought adding people would speed the work. This is a classic Mythical Man-Month problem.
When working together eliminates all the coordination work and merge pain, the team can be smaller and more responsive.
Piles of waiting pull requests are a symptom of disparate goals within the team.
When we divide tasks among people, we can say “we’re working on it” about several things at once. Is that something your organization wants? If so, then it is holding your team back from focus. If this is the organizational API you need to meet, try marking five tasks “in progress” in JIRA, then working one at a time together.
The team works most smoothly as a unit. Production software needs a team behind it because so much knowledge is required: the purpose of the software, its customers, its interfaces, all the tech it runs on and the data it stores and all the changes in the world (such as vulnerabilities) that it needs to respond to. It takes several people to hold all this, with redundancy. To change software safely, combine all that knowledge. We can do this efficiently together, or painfully alone: asynchronous, with a lot of coordination and unpredictably stepping on each other.
Pull requests are an improvement on working alone. But not on working together.
We know that code review improves outcomes– compared to coding alone without any review. Don’t do that. Do code together — with constant, live review and growing understanding between the team members and the code, between the team members and each other.
Leave the pull requests for collections of individuals sharing a codebase. Give me direct collaboration on my team.
This short documentary film was produced in the early 1990s by the National Institute for Spectral Haptics as a part of an outreach and education campaign. It serves as a basic overview of the history of Federal Occult Management in the United States throughout the 20th Century.