Click here to go see the bonus panel!
That stick figure diagram is the most joy I've had making an illustration in years.
That stick figure diagram is the most joy I've had making an illustration in years.
Co-authored by Kelly Shortridge and Ryan Petrich
We hear about all the ways to make your deploys so glorious that your pipelines poop rainbows and services saunter off into the sunset together. But what we don’t see as much is folklore of how to make your deploys suffer.1
Where are the nightmarish tales of our brave little deploy quivering in their worn, unpatched boots – trembling in a realm gory and grim where pipelines rumble towards the thorny, howling woods of production? Such tales are swept aside so we can pretend the world is nice (it is not).
To address this poignant market painpoint, this post is a cursed compendium of 69 ways to fuck up your deploy. Some are ways to fuck up your deploy now and some are ways to fuck it up for Future You. Some of the fuckups may already have happened and are waiting to pounce on you, the unsuspecting prey. Some of them desecrate your performance. Some of them leak data. Some of them thunderstrike and flabbergast, shattering our mental models.
All of them make for a very bad time.
We’ve structured this post into 10 themes of fuckups plus the singularly horrible fuckup of manual deploys. For your convenience, these themes are linked in the Table of Turmoil below so you can browse between soul-butchering meetings or existential crises. We are not liable for any excess anxiety provoked by reading these dastardly deeds… but we like to think this post will help many mortals avoid pain and pandemonium in the future.
The Table of Turmoil:
Permissions are perhaps the final boss of Deployment Dark Souls; they are fiddly, easily forgotten, and never forgiven by the universe.
“Allow all access” is simple and makes deployment easy. You’ll never get a permission failure! It makes for infinite possibilities! Even Sonic would wonder at our speed!
And indeed, dear reader, what wonder
allow * inspires… like a wonder for what services the app actually talks to and what we might need to monitor; a wonder for what data the app actually reads and modifies; a wonder for how many other services could go down if the app misbehaved; and a wonder for exactly how many other teams we might inconvenience during an incident.
Whether for quality’s sake or security’s, we should not trade simplicity today for torment tomorrow.
Key management systems (KMS) are complex and can be ornery. Instead of taming these complex beasts – requiring persistence and perhaps the assistance of Athena herself to ride the beast onward to glory – it can be tempting to store keys in plaintext where they are easily understandable by engineers and operators.
If anything goes wrong, they can simply examine the text with their eyeballs. Unfortunately, attackers also have eyeballs and will be grateful that you have saved them a lot of steps in pwning your prod. And if engineers write the keys down somewhere for manual use in an “emergency” or after they’ve left the company… thoughts and prayers.
You’ve already realized storing keys in plaintext is unwise (F.U. #2) and upgraded to a key management system to coordinate the dance of the keys. Now you can rotate keys with ease and have knowledge of when they were used! Alas, no one set up any roles or permissions and so every engineer and operator has access to all of the keys.
At least you now have logs of who accessed which keys so you can see who possibly leaked or misused a key when it happens, right? But how useful are those logs when they are simply a list of employees that are trusted to make deploys or respond to incidents?
The logical conclusion of fully automated deployments is being able to push to production via SCM operations (aka “GitOps”). Someone pushes a branch, automation decides it was a release, and now you have a “fun” incident response surprise party to resolve the accidental deploy.
One option is to enforce sufficient restrictions on who can push to which branches and in what circumstances. Or, you can go on a yearlong silent meditation retreat to cultivate the inner peace necessary to be comfortable with surprise deployments.
The common “mitigation” “plan” is to only hire devs who have a full understanding of how git works, train them properly2 on your precise GitOops workflow, and trust that they’ll never make a mistake… but we all know that’s just your lizard brain’s reckless optimism telling logic to stop harshing its vibes. Make it go sun on a rock somewhere instead.
Sometimes build tooling is janky and deployment tooling even jankier. After all, you don’t ship this code to users, so it’s okay if it’s less tidy (or so we tell ourselves). Because working with key management systems can be frustrating, it’s tempting to include the keys in the script itself.
Now anyone who has your source code can deploy it as your normal workflow would. Good luck maintaining an accurate history of who deployed what and when, especially when the “who” is the intern who
git clone’d the codebase and started “experimenting” with it.
You’ve decided that developers should be free to choose the best libraries and tools necessary to get the job done, and why shouldn’t they? For many, this will be a homegrown monstrosity that has no tests or documentation and is written in their Own Personal Style™. The dev who chose it is the only one who knows how to use it, but it’s convenient for them.
But is it the most convenient choice for everyone else? What about when the employee leaves and shutters their github account? The supply chain attack3 is coming from inside the house!
As MilTOR Freedmem quipped years ago, “Nothing is so permanent as a temporary access token.”
The deployment is complicated and automating all of the steps is a lot of work, so the logical path is to deploy the service manually just this once. The next quarter, there’s an incident and to get the system operational again, it’s quickest to let the team lead log in and manually repair it.
But after the access is added, it’s all too easy to overlook removing the access. Employees would never take shortcuts or abuse their access, right? And their accounts or devices could never be compromised by attackers, right?
Leadership claims your onboarding and offboarding checklists are exhaustive and followed perfectly every time. And, indeed, your resilience and security goals rely on them being followed perfectly. A safety job well done! No one will be able to deploy your application after they’ve put in notice!
What’s that? That wasn’t part of your checklist, too? Or did you skip over that item because it’s too hard to rotate the keys if some employees quit because they’re too essential and baked into too many systems?
You’ve replaced those keys but they aren’t destroyed and aren’t revoked and don’t expire, so your only hope now is the org didn’t piss off the employees enough for them to YOLO rage around in prod. Sure, former employees have always expressed goodwill towards your company and no one has ever left disgruntled… but would you bet on that staying true?
Sharing credentials isn’t just something engineers and operators share between themselves. If you’re extra lucky, they’ll bake them into the software or services and then when they leave or transfer to a new department, the system will fail when their permissions are revoked. Maybe sharing isn’t caring.
Some businesses run on engagement. The more users interact with the platform, the more they induce others to interact, which means more advertising messages you can show them with a more precise understanding of what they might buy. Teams track engagement metrics closely and every little design change is justified or rescinded by how it performs on these metrics. It’s a merry-go-round of incentives and dark patterns.
But one day you migrate to a new login token format or seed, forcing everyone to log in again and the metrics are fucked because many users don’t want to go to the trouble. Those fantastic growth numbers you hoped would bolster your company’s next VC round no longer exist because you broke the cycle of engagement addiction.
Logging and monitoring are essential, which is why getting them wrong wounds us like a Minotaur’s horn through the heart.
Systems are hard to analyze without breadcrumbs describing what happened, so logging is an essential quality of an observable system.
Ever-lurking in engineering teams is the natural temptation to log more things. You might need some information in a scenario you haven’t thought of yet, so why not log it? It will be behind the debug level anyway, so it does no harm in production…
…until someone needs to debug a live instance and turns the logging up to 11. Now the system is bogged down by a deluge of logging messages full of references to internal operations, data structures, and other minutia. The poor soul tasked with understanding the system is looking for hay in a needlestack.
Worse, someone could enable debugging in pre-production where traffic isn’t as high4 and not notice before deploying to the live environment. Now all your production machines are printing logs with CVS receipt-levels of waste, potentially flooding your logging system. If you’re extra unlucky, some of your shared logging infrastructure is taken offline and multiple teams must declare an incident.
Who doesn’t want peace and quiet? But when logs are quiet, the peace is potentially artificial.
Logs could be configured to the wrong endpoint or fail to write for whatever reason; you wouldn’t even be aware of it because the error message is in the logs that you aren’t receiving. Logs could also be turned off; maybe that’s an option for performance testing5.
Either way, you better hope that the system is performing properly and that you planned adequate capacity. Because if the system ever runs hot or hits a bottleneck, it has no way of telling you.
Your log pipelines were set up years ago by employees long gone. Also long gone is the SIEM to which logs were being sent. Years go by, an incident happens, and during investigation you realize this fatal mistake. Your only recourse is locally-saved logs, which, for capacity reasons, are woefully itsy bitsy and you are the spider stuck in a spout, awash in your own tears.
You’ve been doing this DevOps thing awhile and have a mature process that involves canary deployments to ensure even failed updates won’t incur downtime for users. Deployments are routine and refined to a science. Uptime clearly matters to you. Only this time, the canary fails in a way that your process fails to notice.
An alternative scenario is that some part of the process wasn’t followed and a dead canary is overlooked. You miss the corpse that is your new version and kill the entire flock.
Having a process and system in place to prevent failure and then completely ignoring it and failing anyway likely deserves its own achievement award. Do you need a better process, or do you need to fix the tools? How can you avoid this in the future? This will be furiously debated in the post-mortem, which, if blameful rather than blameless will likely result in this failure repeating within the next year.
A system is crying out for help. Its calls are sent into the cold, uncaring void. Its lamentable fate is only discovered months later when a downstream system goes haywire or a customer complains about suspiciously missing data.
“How could it be failing for so long?” you wonder as you stand it back up before adding a “please monitor this” ticket to the team’s backlog that they’ll definitely, totes for sure get to in the next sprint.
Yay, the new version of the service writes more log data to make it easier to operate, monitor, and debug the service should something ever go wrong! But, there’s a catch: some of the new log messages include sensitive data such as passwords or credit card details. This may not even be purposeful. Perhaps it logs the contents of the incoming request when a particular logging mode is enabled.
Unfortunately, there are very specific rules that businesses of your type must follow when handling certain types of data and your logging pipeline doesn’t follow any of them. Now your near-term plans are decimated by the effort to clean up or redact logs that you otherwise wouldn’t have to if the engineer that added that logging knew about the data handling requirements. By the way, the IPO is in a few months. XOXO.
There were assumptions about what you deployed and those assumptions were wrong.
Builds are automated and we tested the output of the previous build, so what’s the harm of rebuilding as part of the deployment process? Not so fast.
Unless your build is reproducible, the results you receive may be somewhat different. Dependencies may have been updated. Docker caching may give you a newer (or older, surprisingly!) base image6. Even something as simple as changing the order of linked libraries7 could result in software that differs from what was tested.
Configurations fall prey to this, too. “Well, it works with allow-all!” Right, but it doesn’t work in production because the security policy is different in pre-prod. Or, the new version requires additional permissions or resources which were configured manually in the test environment… but, cranial working memory is terribly finite, and thus they were forgotten in prod.
There are numerous solutions to this problem (like reproducible builds or asset archiving), but you may not bother to employ them until a broken production deploy prompts you to. And some of the solutions descend into a stupid sort of fatalism: “If we don’t have fully reproducible builds, we don’t have anything, there’s no point to any of this.” And then Nietzsche rolls in his grave.
We have to move fast. New features. Tickets. Story points. Ship, ship, ship. Developers with the velocity of a projectile. Errors? Bah, log them and move on.
If something is incorrect, surely it will be noticed in test or be reported by users – spoken by someone who has never faced an angry customer because their data was leaked or discovered their lowest rung employee fuming with resentment when they see the company’s fat margins.
Alas, too often we see a new version which forgets to check auth cookies, roles, groups, and so forth because devs test it as admin with the premium enterprise plan, but forget lowly regular users on the the free tier can’t do and see everything.
Your infrastructure is declarative, but the world is not. The app works in isolation, but doesn’t accept the data from the previous version or behaves weirdly when faced with it.
Possibly the schema has changed, but the migration path for existing data (like user records) was never tested. You didn’t test it because you recreated your environment each time. The new version no longer preserves the same invariants as the old version and you watch in horror as other components in the system topple one by one.
Possibly you’re using a NoSQL database or some other data store for which there isn’t a schema and now the work of data migration falls on the application… but no one designed or tested for that.
Or, maybe you’re pushing a large number of updates to a rarely used part of your networking stack. For those that are all-in on infrastructure as code (IaC), supporting old schema, data, and user sessions can be a thorny problem.
A shocking number of outages spawn from what is, in theory, a simple little configuration change. “How much could one configuration change break, Michael?”
Many teams overlook just how much damage configuration changes can engender. Configuration is the essential connective tissue between services and just about anything that can be configured can cause breakage when misconfigured.
The clock is ticking and sweat is sopping your brow. Something must be done to avoid an outage or data loss or some other negative consequence. This fix is at least something and this something seems like it should work8. You deploy it now because time is of the essence. It fails and you now have less time or have caused more mess to clean up.
Only in hindsight do you realize a better option was available. Or, maybe the option you chose was the best one, but you made a small mistake. Was the haste worth it?
Urgency changes your decision-making. It’s a well-intentioned evolutionary design of your brain that causes unfortunate side effects when dealing with computer systems. In fact, “urgency” could probably be its own macro class of deploy fails given its prevalence as a factor in them.
“Everything works in staging! How could it have failed when we pushed it live? I thought we did everything right by testing the schema migration with our test data and load testing the new version.”
Narrator: The software engineer is in their natural habitat. Observe how they pull at their own hair, a hallmark of their species to signal that something has distressed them. It is very difficult to replicate everything that’s happening in production in an artificial test environment without some sort of replication or replay system. This vexes our otherwise clever engineer.
“It causes a crash!? What kind of deranged mortal would have an apostrophe in their name? Oh, it’s common in some cultures? Hmmm…”
If you keep your service online as you deploy, you should really test your upgrade path under simulated load. If you don’t, you can’t be sure if your planned upgrade process will work or how long it will take.
When you make deployments easy, it is possible to make deploying to prod too easy. And easy to use doesn’t necessarily mean easy to understand. When a slip of the finger results in code going live, you may want to consider just how far you’ve taken automation and if other parts of your process need to catch up.
Because one day, a sleep-deprived Future You is going to run a deploy script where you have to pass in an environment name and you will type
dve instead of
dev. Once it dawns on you that the deploy system falls back to “prod” as the default, adrenaline shocks you awake with the force of 9000 espressos and you will never sleep again.
The regrettable reality is that internal tools often offer terrible UX because engineers refuse to give themselves nice things (including therapy). These tools, akin to a rusty sword with no hilt, make these sorts of failures tragically common. The rise of platform engineering is hopefully an antidote to this phenomenon, treating software engineers as user personas worthy of UX investments, too.
You have a staging environment (congrats!), but it’s an ancient clone from production which has seen so many failed builds, bizarre testing runs, and manual configs that it bears only a pale resemblance to the system it’s supposed to epitomize. It gives you confidence that your software could deploy successfully, but not much else.
You wish you could tear it down and rebuild it anew, but everyone’s busy and it’s never quite important enough for someone to start working on it rather than some other task. Thus you’re doomed to clean up small messes that could be caught by a true staging environment.
At the next DevOps conference you attend, every keynote speaker refers to the “fact” that “everyone” has a “high-fidelity” staging environment (“obviously”) as you weep in silence.
Production systems are incredibly important and we must patch frequently to keep them in compliance. But the same diligence isn’t applied to pre-prod, development, build and other environments.
The systems in these environments may therefore be wildly out of date and the software they produce may be incompatible with the up-to-date, patched production system. Systems will drift so far from the standard that QA systems look like an alternate reality from production and make you a believer in the multiverse hypothesis.
A production deploy requires a backup because hot damn have we fucked it up so many times and a backup makes everyone feel more confident. The administrator responsible for performing the backup writes the backup over the live system, causing an outage. Furthermore, because the data was overwritten by the botched backup, any existing backups are not recent.
Backup fuckups happen more than anyone admits and when they go down, they go down hard. Recovering from them is rough because no one thinks it will happen to them.
Lesser failures in this category include saturating the disk or network IO of the host taking the backup or filling the disk – each perfectly capable of causing an outage, too.
Audit logs are accidentally turned off during a configuration change or as part of a software upgrade and now the system is out of compliance. No one notices until the auditors ask for the audit logs months later and a wave of panic ripples through the teams involved.
Will we fail the audit? Will customers drop us? How much revenue is impacted? Will we still get raises at our quarterly review? Will I have to switch to getting artisanal roasted bean elixirs every other day?
Only the simplest services run entirely isolated without any other dependencies. When done right, dependencies are properly documented and the infrastructure dependencies of each component are clear. Even better, the dependencies are specified declaratively, rendering it impossible for the human-generated documentation to drift from the machine specification.
But in less auspicious cases, the dependencies are hazy and can even form chains which loop back on themselves like a branching ouroboros eating its own rotting tails. Debugging a production incident for a system with unknown dependencies is software archeology where the only treasure is tears.
The infrastructure upgrade toppled some of the apps and services running on top of it, but the people deploying the upgrade lack context on those casualties and you all wonder when the Jigsaw puppet will come into view and reveal this has all been a grand experiment to pit you against each other.
“We upgraded the OS, clearly everything will be fine!” My brother in christ your system fetches Kerberos creds automatically on boot, but your first boot on a fresh host fails because the Kerberos fetch infra depends on a QA host that was decommed 6 months ago!
And then there’s the ultimate iceberg dependency: DNS. If DNS is borked or misconfigured, all sorts of thorny problems can emerge.
Vendors make all sorts of claims about the behavior of their wares. It’s fast and stable. It migrates its data format. It slices and dices. It follows semver. Should you believe them? In a word, no.
Playing god with your environments does not always result in intelligent design.
Per-environment configuration is a fact of life. Hostnames, instance counts, and other configuration settings will be necessarily different between environments. Keeping these up to date can be a challenge and it’s all too easy to overlook updating the production template when new configs must be added.
New configuration values are often copied from a staging template into the production one without appropriate adjustments like switching the hostname. You will wonder which evil eldritch god you pissed off when deploying to prod takes down both the production and staging environments. This is so frequent and yet! and yet.
Deploying a configuration change is easy: apply the configuration, restart the service. You might think it should be easy to remember the steps when there’s only two of them, but it’s easy to overlook for quick deployments. Only later do you realize you set a new config variable in prod without applying it to the prod instances.
Design patterns like the D.I.E. triad can help — there’s no way for infrastructure to drift if it’s redeployed from scratch on each deployment. And, of course, automated deployments can help, too.
Feature flags are a simple and amazing way to explode the number of system states you must test. N flags make for 2^N combinations. Are all of them tested? Are you sure they’re all set correctly? Do the people who test your application have the same flags as the unwashed masses? Are there old feature flags in your app that should be retired? What could happen if they were activated mistakenly? (just ask Knight Capital).
Maybe you push a release before the company holiday party and deploy the entire release successfully… until a few intoxicants in you realize you forgot to flip the feature flag and now you’re crying in the bar’s bathroom shakily singing along to Mariah Carey (though you suspect the “baby” she desperately wants for xmas isn’t a feature flag).
It’s also possible that you do the exact opposite and flip the flag too soon. Maybe a new product is accidentally announced early, deflating all the carefully constructed marketing plans leading up to the company conference and leading customers to ask why the new feature is “broken.” You had just regained the respect of the customer support team too…
Perhaps the new feature simply uses too many resources and you haven’t scaled your infrastructure appropriately. Or maybe the freemium gate is broken and everyone gets access to premium features. Good luck explaining to customers why you now have to take away their new shiny feature unless they pay up for it…
Faulty configuration may not necessarily cause failures immediately. It’s only after you do some other, seemingly unrelated operation does the fault cause any symptoms. Like medicine, it can be difficult to untangle exactly what faults are the cause of what symptoms.
For systems like load balancers or orchestrators, a bad configuration can remain in place and as long as the system is stable, the misconfiguration will cause no ill effects. But one day when you decommission a cluster as planned, another cluster immediately shits itself – suffering a total outage baffling everyone – and only after many painful hours of debugging do you realize its healthchecking was configured against the one you decommed.
If the team owning that other cluster has poor monitoring hygiene, they may only discover their service is dead much later. But the outage gods care not for your mortal troubles and will do nothing to ease the pain of what is now a multi-day incident all due to faulty health checks.
The layers far underneath your application can still cause your deployment to fail.
Orchestrator fails? Your service is dead. Operating system fails? Dead. Disk controller fails? Dead. BGP? Dead. DNS? Dead. Backhoe cuts the backbone to your sole datacentre? Dead. NVMe subtly violates DMA protocol? Dead. NIC driver fails or goes rogue? Dead. Baseboard management controller borks? Dead. Deploy a bunch of new machines into a cluster with a bad BIOS? This may shock you, but: dead.
Deployments may appear to succeed only to fail hours or days later if you have periodic background jobs or the ability to schedule tasks. The deployment isn’t successful until these jobs and tasks run successfully.
Perhaps you deploy a busted systemd timer which causes all your nodes to self-destruct after 8 hours… and only discover this “fun” fact after you deploy to your first tranche in prod. See also: the dreaded slow memory leak.
Another variant is the odd date/time bug which causes the application to malfunction only on leap years or when daylight savings time occurs. If you’re not swift with your incident response, the incident resolves itself and you’re left scratching your heads until someone realizes it’s because the clocks rolled back.
Do you bother fixing the bug? Or do you hope to find another job before the next orbital period elapses?
Components may have poorly documented or undocumented limitations or may simply become unusably slow when assigned more work than they were designed for. Does your database have a limit on the number of connections? Better not scale the number of clients beyond that number, then!
Is it a deployment failure? Yes, if a deployment pushes the system beyond its limits, which is more likely to happen when you add new v2 replicas before retiring old v1 replicas.
In the microservices world, this can manifest as running so many k8s jobs without deleting them that all the k8s operations on jobs begin taking tens of seconds because the cluster is bogged down with so much cluster metadata. Is the inevitable conclusion of microservices simply more microservice instances and metadata than actual work and user data? Makes u think.
Some well-meaning person may decide that builds for a piece of software always use the latest version of its dependencies. This ensures that whenever you release, you always have the latest security patches.
This sounds wise until one of the dependencies causes a subtle API breakage and your app fails to function. Or, any of your dependencies’ authors could decide “fuck this, I’m not maintaining this open source project anymore and giving corporations free labor” and push a dead version of a package.
Now you’re unable to build new versions of the app until someone resolves the dependency situation. Worse, if that fed-up developer has pulled their old versions out of spite or frustration with the pain of maintaining OSS, and if you haven’t archived builds of old versions, then you may not even be able to deploy at all. And this is how you end up cursing a random dev you hadn’t even heard of until just now when you should be taking your lunch break.
Mere mortals cannot maintain accurate mental models of data in distributed systems. Even the divines struggle.
Some irreversible process fails part way through your deploy. Possibly it was a migration or some other critical step during your deployment. For whatever reason, this step didn’t happen when deploying to the other environments; it only happened in the one environment that matters most.
What state is the system actually in? Should you rollback? If you try to roll back, will it even work? You’re in uncharted waters under shrouded stars.
Data migrations are often a one-way process. Have you tried migrating all of your existing data to see what happens? How long does it take? Do you have backups? Could you even use the backups, or would restoring result in yet more downtime?
If you don’t know the answers to these questions, you might find yourself deploying an ORM/data model layer which automatically migrates read-only database values to a new format and somehow corrupts the records, resulting in you frantically trying to patch and deploy a fix before too much of your DB becomes unreadable.
Or perhaps you set
--timeout 10 on your ORM migration with the innocent assumption that “10” here refers to second. It’s 10 milliseconds. There are no down migrations. And migrations can be arbitrary JS and therefore not guaranteed to be atomic or idempotent and now you’ve started a slow-motion train crash that you cannot stop. One hour of scheduled downtime becomes 18 hours. Your youth and zeal is irreversibly drained.
Distributed storage / database systems require careful understanding of their operational characteristics if you are to operate them safely. They can be used to achieve better uptime, reliability, and possibly even lower latency if operated within their safety margins… but they also require more care and feeding than traditional databases with an authoritative primary and can be quite temperamental.
If operated incorrectly, distributed storage can silently lose data or disagree on the data they contain if nodes aren’t retired correctly or if an insufficient number of nodes remain healthy. Do you know enough about your data storage layers to operate them safely? Or when you next roll-reboot your Elasticsearch cluster will it silently eat 30% of your data for seemingly no reason at all? The customers now complaining that all their graphs are 30% too low are certainly not silent.
When you deployed new database nodes to prod, did you assume the cluster would rebalance on them? Oopsies, it didn’t! And thus when you decommissioned the old nodes, you destroyed 99% of your data in the process. There are not enough oofs in this universe to reflect this oofiness.
If caches aren’t healthy, rolling restart instructions aren’t followed or are insufficient and the system fails to start.
Where to begin? Let’s start with why caches exist in the first place: to avoid repeated execution of expensive computations by storing a mapping between inputs and their results in memory (aka “caching them”). Caches will typically discard infrequently-used results automatically to make space for frequently-used results, and can be asked to drop any results that are no longer valid. How can this go wrong? Oh so many ways!
First off, just like database schema, the format of data in the cache might not be compatible with the new version of the app. Similarly, when there is more than one application instance, the old version of the app will run alongside the new version and could see cache entries from its successors. This can cause problems where either the old version or the new version of the app could malfunction from improper data. Deploying a canary can cause all of the instances of the software version to fail.
Have your engineers thought about cross-version compatibility? Do they reject linear notions of spacetime and thus believe compatibility is a blasphemous act against the holographic principle? “Spacetime is just an abstraction,” they tell you cooly while sipping their matcha latte.9 You are tempted to remind them that money is also an abstraction and therefore they should abstain from it, too, but it’s faster if you just fix it yourself.
Second, the keys might change. If version A of an app uses one nomenclature for keys but its successor (version B) uses another, version B will operate as if the cache is empty. The app now must perform much more work to populate the cache in the new format. If both versions of the app are running simultaneously, they will fight for space in the cache – and the cache is limited in how much data it can hold by necessity. Now the cache has a lower hit ratio and more requests must go through the more costly “uncached” path.
Third, a common ReCoMmEnDaTiOn is to flush caches when deploying new versions of software (“it’s a caching issue, clear your browser cache” said the frontend dev to the product manager as the PM rolled their eyes). This can be dangerous when using a shared cache since so much extra work must now be performed with every request.
With healthy cache hit ratios commonly being in the 90% range for some workloads, that means the part of the application beyond the cache must handle ten times the throughput until the cache is rebuilt. Could you handle a sudden 10x increase in your workload?
We make piles of thinking sand talk to each other through light and wonder why weird shit happens.
The accidental self-DoS could be due to many reasons. Maybe new versions of the application inhibit the CDN’s ability to cache, but this non-functional requirement wasn’t recorded anywhere. Maybe a new analytics feature inundates the application backend with data collected to appease the whims of product management. Maybe a new retry mechanism is being used for failed requests, causing traffic amplification if the backend becomes even a little sluggish.
The end result is the same: the new version of the app swamps the backend service and causes downtime. Engineers tirelessly work to restore service by standing up more instances or filtering the unnecessary traffic the application created for itself.
You ask your devs what happened and they said, “Well, it didn’t work with CDN so we added cache-busting headers to make it work.” You nod quietly while gazing into the abyss.
The previous version of the app configured common static assets with a long cache duration. This caches the asset for long periods of time in CDNs and in users’ browsers. Fabulous! The app loads more quickly for users, especially those that visit frequently.
You build a new version of the app with new cached assets. The new version looks great in staging and dev, where testers are unlikely to have stale cached assets. But when you deploy it to production, you receive reports from your most fervent supporters that the app “looks weird.” It’s a Frankenstein’s monster mismatch of static assets from the old and new versions and behaves unpredictably.
Before enough understanding of what has happened filters through to the development team, all of the stale caches expire and the dev marks the JIRA ticket closed. The issue repeats again when you release the next minor redesign.
Due to the nature of CDNs and prod websites, there’s a category of people for which this is a persistent problem and they should be able to fix it… and yet can’t. The entirely avoidable fuckup is a formidable beast.
You disconnect clients simultaneously during your deployment, leading to them all trying to reconnect simultaneously shortly thereafter. Your system was never designed to handle a flash flood of connections, so it stays down until it’s scaled manually well beyond what it was originally budgeted for.
Someone throws a ticket to add exponential backoff with randomization to the bottom of the client team’s backlog. Years pass and it happens again as their backlog only grows.
Purging a CDN with a cache hit ratio of 90% results in an immediate 10x throughput increase to the origin. Did you deploy the required additional capacity?
It’s such an easy button to press, too. Some CDNs don’t put a glass case around the button nor require administrator permission to press it. Pressing it immediately grants you the rank of “rogue developer” and now you’ve given your security team a reason to require ten more hours of annual security awareness training. Your access to the secret cools kids Slack channel is purged, too.
Adjust some network config you read about on Stack Overflow and suddenly the site is down and no one has access to the systems that can bring it back up and ahhhhh. You frantically call your AWS or colo account rep to see what they can do as your mobile device buzzes incessantly.
The essence of this fuckup is that the outage locks you out of the systems which need to be accessed to resolve the outage. This can be something as simple as firewall rules or as complex as unicast BGP configurations across complicated multi-vendor networks that locks everyone out of your data centers.
A core service on which your orchestrator depends is down. You would normally use the orchestrator to deploy the service, but since the service is down, the orchestrator no longer functions. Now someone must dig out the dusty documentation on the old manual way to do this as the clock is ticking. Does the manual way even still work? Who even has access?
Elsewhere, you put Consul into the deploy path six months into its lifespan and it packet-storms itself into oblivion, taking down not only service discovery but also your ability to deploy anything or even log into nodes.
“No plan of operations reaches with any certainty beyond the first encounter with production” – Helmchart von Faultke10
It’s truly shocking how often orgs don’t have a rollback plan. But just like your mom told you about jumping off bridges, just because everyone is doing it doesn’t mean it isn’t dangerous.
There’s more than one way to handle this properly, like CD with canaries, blue / green deploys, full rollback of everything… but to not have a strategy for this at all and YOLO it? If only we gatekept less against liberal arts majors to fill this chasm of critical thinking.
A special mention goes to the untested rollback plan, too. “We have a complicated deployment that went smoothly in staging, pre-prod, and every other environment, so why would we ever need to rollback?” you say. “It can’t possibly fail in production,” you say.
You’d be correct 9 out of every 10 times… but how many times do you deploy a year again? So, you painstakingly craft a rollback plan for your deployments, but never test it since it’s unlikely to be used. And how little confidence you have in your rollback plans leads to this next fuckup.
Something didn’t go as planned, so you decide to roll forward with some new plan you came up with on the spot instead of rolling back – and then something fails in the roll forward.
This is a fuckup sprouting from the “developers are optimistic by nature” problem. A deployment fails on what you believe is some minor technicality. And then you fail to resist the temptation of making a “quick fix” to patch it while on the call and build a new version of the software so your team can ship…
…But it might not be a quick fix and you’re proposing deploying something completely untested straight to production. Somehow the SRE team is okay with this, or maybe they’re hesitant but let it slide since there are already too many hills on which they must die.
Either way, you’re risking your uptime and stress for deploying a little earlier than you otherwise would. A worthy heuristic for this might be: because developers appear to be optimistic by nature, even the “tiniest” of hotfixes are incomplete and require more testing.
Your system has a job queue with workers that are carefully tuned not to consume too much money and still complete their work. While the maintenance page is up, the workers are shut off. Deploying the app takes longer than expected and scheduled tasks pile up. The original pool of workers is no longer sufficient to process the backlog of scheduled tasks and people waiting on their results find your team to be insufficient.
Circular infra dependencies result in a particularly nefarious failure pattern. If anything in the chain ever goes down completely, it’s impossible to stand the system back up without yolo-rushing a new version of a component to break the chain. For instance, perhaps you store the latest deployed revision on your own host, which means you can’t access it when something goes wrong.
You may design your system nicely, but time inexorably marches forward without regard for your intentions. This failure is an emergent property of all the changes people make over time. It’s an iceberg failure that only emerges when another failure has already emerged and is plaguing you. That is to say, circular infra dependencies result in a particularly nefarious failure pattern…
No amount of fancy automation can truly save you from disorganized organizational processes.
Raise your hand if you’ve ever worked at a company with great internal documentation. Try to recall when you’ve ever read truly complete and up to date deployment documentation. For many of you (most of you, even), nothing comes to mind, right?
The closest might be a well-commented deployment script and some associated high level description. Perhaps it’s a design doc that you trust to be sort of right but cannot assuage your suspicion that the implemented system has drifted away from it. If you trust your documentation to be 100% accurate when deploying software, you’re going to have a bad time because it’s inevitable that there will be errors in it.
And because no one wants to write docs, numerous fuckups occur. You followed outdated or misleading docs on how to make the release, which fucked up the deploy. You forgot to update customer-facing docs and they configured something incorrectly and now all your other customers are suffering from the outage.
You forgot to send release notes which, wait, how is that a fuckup? Oh right, the account manager for your largest customer added in terms about releasing their requested feature by a certain date (without telling anyone in product or engineering of this, naturally) and now you’re re-negotiating their multi-year contract and giving them a serious discount to stay which is going to be difficult for your CEO to explain on the next earnings call.
People are congratulated for resolving the downtime or for catching a failure as it’s happening, but no one is rewarded for anticipating failures ahead of time.
The CEO wants things to be shipped now so everything is a rush to get half-baked features out the door quickly. But that causes quality problems elsewhere. At least half the deploys have an emergency “oh shit something is borked” follow-up deploy. And either you roll forward or the app limps along and languishes in a janky existence for the next five days until someone builds the fix and ships it.
Whoever ships the fix is lauded for restoring sanity, but it never should have been broken in the first place. And everyone knows if they had chosen to roll back, the CEO would’ve been angry because his little gamification feature wouldn’t have been there for five days. You suffer, your team suffers, your customers suffer, but bossman is happy and the bleary-eyed engineer who spent days on the recovery gets a pat on the back. Well done, naive salaryman.
Then a conference is coming up; your CEO and CMO demand a splashy announcement for it. That means your Q3 deploys are now beginning-of-Q2 deploys… which is in two weeks. You ship a ton of stuff that is half-baked and barely strung together, but the press release goes out (along with the press releases of all your competitors in an unnavigable sea of babblespeak that the market largely ignores).
The team is congratulated while the architect cries in the bathroom grieving their multiple quarters of work of carefully planned releases as support tickets now pile up with customer complaints about how features are broken. By end of year, half the features are still being “stabilized” and the other half are mothballed.
A task is rarely performed, so there’s no documentation on it. Regrettably, someone must perform the task now and today the universe has decided for that person to be you. You go to look for documentation and find nothing. You look at the code for the systems involved and it’s unintelligible. You
git log the associated files and discover that everyone involved with the system has already moved on. You wonder if you should move on, too.
When disparate teams try to coordinate on rarely-performed tasks a special sort of confusion emerges.
It’s deemed necessary for internal data analysts to be able to run queries against production data so they can serve customers and forecast future business (or other such violations of linear time). They’re granted read-only credentials to the production database because that should be sufficient. Later, you are paged because the service is down and the database is wedged.
You discover that one of the data analyst’s queries is taking up way too much memory and has locked a critical table. You kill the query, sever access, and prepare for hell in the morning. In the end, you deploy a replica so the internal teams can query production data without killing the production database. Leaders considered it too expensive to set up originally, but how expensive was the outage and all the effort which went into restoring service?
Once upon a time, you and your team decided to rewrite an app because your company’s business model changed and thus very little of it was still useful. You also didn’t like Ruby, so you decided to rewrite it in Scala because Scala was hot and everyone on the team wanted to learn Scala. Great, let’s trust our important business function to people learning a new language!
The first version of the app was supposed to be deployed alongside the Ruby version and coexist with it. That deployment failed and also caused the Ruby app to fail. Repairing that took 8 hours of downtime. Naturally, the sysadmin didn’t particularly appreciate having to stay for an extra 8 hours on a Friday because your team wanted to deploy outside of business hours.
A month later, you try again. It deployed successfully! …But the migration for the user accounts fucked up. You could use the new app, but no one had accounts for it other than the root account. A week later, you try again with a script to deploy all the user accounts – and that was successful.
Later, your team discovers the v1 of the app is very slow when actual work is done in it. So, you switch to using Cloudsearch to “optimize” part of the app. And it does! …Except Cloudsearch is eventually consistent and now users complain that when they add something to the app and click refresh, it doesn’t show up until 30 seconds later.
Your team rushes a hotfix to undo the Cloudsearch integration and restore the previous functionality. The sysadmin says no. You gave them less than a day’s notice to deploy this new version, even though your team knew about it for a week while you worked on undoing the integration. You will be lucky if you ship anything else the rest of the year now.
tl;dr the sysadmin is fed up and doesn’t trust anything your team deploys now.
Your company prides itself on being a meritocracy with a flat hierarchy, which is why senior leaders (like your boss) can disregard deploy processes – like making a production fix for a bug on the production node and recompiling, re-introducing the bug on the subsequent deploy because they never fixed the issue in tree.
This travesty is an argument in favor of making manual deployments impossible or difficult (see #69), but there’s no guarantee that any proposed safeguards would avoid veto by the Director of YOLO Engineering who is responsible for the fuckup in the first place. Because it’s never their fault, is it?
There’s also a coding variant to this fuckup: someone yolo-typing new code into a live virtual machine. They hot patch at the Erlang console because they relish living in sin. It might be called performance art if it wasn’t fated to desecrate service performance.
That anyone would be allowed to do this assuredly reflects organizational dysfunction. It is so bonkers to be able to just like, write code on a production box and expect that it works. It is a pathological level of optimism. It is suspiciously reminiscent of the Pyro in TF2 who runs around burning everyone to a crisp with a flamethrower while, from their deranged vantage, they are showering the world in glittering rainbows and bubbles and whimsy.
“Well, I’d never do that!” you say, thinking this doesn’t apply to you. And then you’d proceed to attach VisualVM to the JMX port and yolo some gc tuning. Or you’d run some exploratory bash or SQL on the prod instance to get some data without having tested it fully in a test environment. Maybe you aren’t debugging in prod, but using tracing or performance analysis tools in prod to debug problems or tune settings without having tried first in QA at the very least makes you a co-conspirator and likely a Staff YOLO Engineer (maybe even Senior Staff if you continue to do it after reading this! Don’t let your dreams be memes).
You have to scale down really quickly because your cloud credits ran out and you can no longer afford your infra… which means you were spending money you didn’t have for a long time because Papa Bezos was your sugar daddy for a bit. As you scale down in a panic, you fail to load test the new database and regret not just selling out at one of the tech giants. Now your organization has successfully reduced costs… but also revenue.
It’s trivial to mentally model your service in isolation; the rest of the world is immutable and your deployment is the only change in motion. In reality, other teams are hurling themselves at their OKRs, your sales team is onboarding new accounts, and your data integrations are data pipelines haphazardly built with popsicle sticks and glue. Like nature, the system is in constant flux and no matter how confident you are in your deploy, an unexpected shift in the system elsewhere can result in your system failing.
Maybe another team has worse deployment hygiene than you do and they yolo’d a version straight to prod without giving you a chance to integrate with it. Maybe they’re hotfixing an incident themselves and your service is collateral damage. Maybe a data partner changes their data format without announcing it (see #51) and every system in the path falls flat on its face.
It’s not your fault, but it is your problem. Scream into a pillow and sing lamentations to your pet or whatever you need to do to process your grief and move on to acceptance. Because if you want to prevail, you must be nimble and maintain the capacity to recover from unexpected failure.
The deployment may pass your tests but it can still break your business logic.
Your team finally tackles tech debt and deploys the new, shiny, streamlined version of the API. A few hours later, a partner is screaming at your CTO because they were using the API in a way you never fathomed was even possible and their integration no longer works due to your change.
Another time, you’re celebrating the successful update of the auth method in your SaaS app. It passed all tests, got approval from the security team, and nothing broke after deployment… but, as you’ll soon realize upon wading into a shit show the next morning, you forgot to tell customers about the auth method update. Everyone built access using a certain type of token and switching the service to use a new method completely broke customer access. Guess who will be blamed for lower renewal numbers this quarter?
The “funny” thing about breaking API changes is devs will often argue what is or isn’t breaking. Semver this, semver that. It still takes the same signature and they only fixed a “bug” in the behavior of the other parameters… but what if the other software was relying on that behavior? Now it’s different and different is bad when customers rely on things staying the same.
Compliance stuff is boring but it matters. Some subtle design, layout, wording, or data retention change in a highly regulated part of the system causes it to no longer be in compliance with one of the onerous compliance regimes it must be a part of for the business to remain viable.
For instance, your payment flow changed and now you’re no longer in compliance with PCI. This remains undiscovered until much later, as most failures of this type are. If you’re unlucky it’s the auditor who discovers it and you’re now buried in paperwork. Or you erode trust by violating user expectations about how you handle their data.
You change something in a way that results in search engines or other traffic sources deranking or delisting you. Maybe it’s as subtle as borking the preview cards; sure, the links still work, but it’s no longer as clickbaity to the ever-shortening attention spans of the plebeian spectators. Congratulations, you just killed your traffic source and meal ticket!
Everyone frantically tries to figure out what is going wrong as bank accounts drain. It might not even be something you changed — sometimes giants simply roll over in their sleep and crush smaller players. But it could also be that you messed up the robots.txt and are now poor.
Deploying the system at scale is different than deploying the little test sandbox version of it.
This fuckup is so, so common. It breaks the simplified, but wrong, mental model that users will talk to your servers and only to that one server. It’s a useful model because it simplifies a bunch of things and is mostly true; when it’s not true, it’s often fine to overlook the effects. But, occasionally, the effects are catastrophic and nothing behaves properly until reality settles.
Canaries and staged multi-region deploys can, by design, take a while – so your upgrade is only partially tested and deployed, resulting in an outage.
Most of the fuckups on this list are due to immature processes. But this one emerges as your processes begin to mature. Observing how your failures transform over time can elucidate your progress, a kind of mindfulness that is admittedly difficult to cultivate when feeling the crushing weight of disappointment.
You’ve had so many deployment failures in the past and every deployment has been painful. Some well-meaning person has decided that deployments need to be surveilled with hawkish intensity. Deployment frequency plummets accordingly and every deployment is a potpourri of changes that various stakeholders demand go live.
Good ol’ batch deploys take forever. People get burned out or fatigued and then naturally make mistakes. Or it’s not their component and they don’t have skin in the game11 and consequently are careless when handling it.
When failure does transpire, everyone’s frustration inflames. It’s either their component that failed and they’re frustrated at the lack of care by their peers, or it’s not their component and they’re frustrated that they have to be on this stupid Zoom call until 04:00.
The answer is probably splitting the deploys out; the only reason not to do separate deploys is likely organizational process or dysfunction (see also: Disorganized Organization).
You’ve put a ton of work into automating your deployments. The automated tooling is effective and deploys exactly what you asked of it – but what you asked of it didn’t match your expectations.
Perhaps you thought you were deploying a branch containing only a hotfix, but it was started from the wrong base branch. Or maybe you thought you were asking it to target only a few canary nodes, but accidentally rolled the whole fleet. Perhaps the automation tries its best to make all of the servers consistent by ensuring changes must be deployed in the same sequence. Whatever it was, automation ruthlessly executed your command and now you’re scrambling to recover.
In many organizations, it’s difficult to justify improving the safety and user experience of internal tools since it doesn’t directly affect customers and “just” makes the system confusing for our engineers working with it. The silver lining is this outage will at least make the case that developer experience is important.
Your new version operates under the assumption that the fleet is only running the new version and all instances speak the same protocol. But in reality, some hosts came back from the dead (i.e. maintenance) running an old version of the software after the deployment completed.
Now you have a zombie apocalypse on your hands with nothing to defend yourself but your laptop. You now regret choosing the ultraportable version rather than the hefty tank boi. And just like zombies, zombie hosts can sneak up on you when you least expect it, long after your deployment is complete when the post-apocalyptic landscape that is your prod environment seems almost serene.
One fine morning, you discover you’ve run out of the specific instance type your service needs. Like, there are literally no more i3.16xlarge instances that exist for you to purchase in this universe (or possibly just the availability zone).
It turns out you are their largest customer, which, of course, the vendor never made clear for strategic reasons. Scaling beyond the capabilities of a vendor inevitably results in downtime. Either you convince the vendor to git gud or you patch to make the app creak along as you frantically build a migration path to a substitute, disrupting the roadmap in the process.
Or, on a Zoom meeting with a bloated attendee list, a dev notes that the app is slower: “I refactored the code to make it easier to read, but now it’s slower, so we need 3x the servers to run it.” You swallow bile. Lucille Bluth asks in your head, “How much could one server cost, Michael?”
If you have rollbacks, you should be fine. If you have autoscaling, you can just pay to address this problem. But nothing can help you automatically scale your tolerance to bullshit or rollback your life choices.
Scaling one part of the system puts pressure on other parts… and now they’re failing. You now must deal with an outage somewhere you weren’t expecting, all because you were proactive in anticipating capacity you’d need in the future. Worse, if that capacity is required right this millisecond, you face the dilemma of choosing which part of the system to sacrifice temporarily while you figure out how to fix the bottleneck.
Manual deploys are truly terrible. If there is a villain in the story of DevOps, it is manual deploys. They are not the serpent in the garden promising forbidden knowledge. Manual deploys are the Diablo boss that probably smells like rotten onions and toe fungus IRL and whose only purpose is to destroy any and all life.
Not convinced yet? Here are reasons A through Z to stop living in Clown Town. Each should be enough to convince you to automate at least the tedious parts of your deploys. Please, we beg you on behalf of humanity and reason, automate all the repetitive tasks you can, even if your org has an aversion to it. Humans are not meant for executing the same thing the same way every time.
An engineer walks into a bar, has two beers, and now is deploying to the entire cluster as they order a third. The bartender says, “You know, if you used an orchestrator, you could order something stronger.” That bartender’s name? Q. Burr-Netty.
Backups of the database probably don’t work. Every time you take a snapshot, it’s someone reading the docs off a DigitalOcean post on how to back up MySQL.
Copy pasta is always served with failsauce. Copying a config from an existing build to a new one, then forgetting to change the version number. Copying SSH authorized keys between machines… and if you’re managing them like that, it’s probably append-only which means your old ops people still have access to your prod servers.
Disk management as a matryoshka doll of disasters: capacity management, failing to provision enough space12, IOPS management, SAN management and all the babysitting required for distributed disks, we probably don’t need to go on.
Expiration of certificates or domains, the tech tragicomedy. You know this will happen again in a year. You see the rhino charging towards you in the distance but there’s always something more urgent to do until it’s too late.
Forget to smoke test the whole environment. You perform manual tests but they only hit the “good” servers. Luck favors the automated.
GeoDNS routing with manual region switching so you can take down a data center and update it without any traffic… but actually DNS takes awhile to propagate so you still have a trickle of traffic coming in (does anyone care that much about those lost requests?).
Handling hardware failures is nigh on impossible. Are your systems even failing over?
Improper sequence when deploying components. Just like your dance moves, the order of your deploy steps is all wrong.
Jumpbox that people use as a dumping ground for random assets they need in prod, like random JAR files or Debian packages, movies they torrent at the office that they want to get on their home machine, random database dumps that people need for various purposes…
Keen to have the deploy done, you do not wait for changes to propagate, the cache to become warm, nor the system to become healthy. “No, sir, the engineer really worth having won’t wait for anybody.” ~ F. Scoff Gitzgerald13
Lonesome server runs the wrong version because you forgot to update all the servers. Or, you forgot one region when you’re doing multi-region updates.
Mismatched component versions. It’s very easy to do when you’re slinging deploys manually and how many database servers do we have again? Is Tantalum down or decommissioned? This IP naming scheme makes no sense. Is it even a database server?
Not copying code to all of the servers and not removing the old code from it, leading to conflicts worse than the tantrums on your executive team.
Overlook which environment you’re in. If it happens, it’s probably a process failure. Because it’s an easy thing to overlook, but there should be a lot more processes in place to catch someone from accidentally farting about in prod. Ideally, you shouldn’t even be able to make this mistake.
Provision users manually. Not only is it a pain in the ass, it is also fraught with peril.
Quarrels between IP addresses and hostnames that rival a Real Housewives reunion special.
Rotate the password or keys, but forget to update the service config with the new password. You rotate the password, so of course you have to update the config, but there may be numerous configs and it can be easy to miss one if it’s not documented or automated.
Smoke tests aren’t performed after manual production deploy. If you’re doing deploys the wrong way (i.e. the manual way), smoke tests are a way to mitigate some of the issues – but you must remember to actually conduct them.
Trusting that your on-call team will be paged despite never testing the paging plan.
Updating the monitoring system is overlooked. If you autoscale, the system managing the autoscaling will self-monitor the hosts. If you add a host manually to a system that doesn’t autoscale, you probably want the system to register with the agent that’s supposed to do the monitoring.
VPN that is a single-point-of-failure and held together with duct tape and twine. The VPN is required to access the network to do the deploys but apparently making it not suck is not required.
Wait for DNS propagation? Who has time for that?
X11 and RDP-based deploys where a tired sysadmin remotely logs into the virtual desktop of a system that shouldn’t even have a graphical environment and haphazardly drags files around until the new release is live. The commands can’t even be audited because there were no commands, only mouse movements.
Your sysadmin does maintenance on the database so that it can stay up, but in the morning you discover the settings they’ve changed cause the database to no longer run its background maintenance processes and you’ve just deferred your downtime until later.
ZIP or JAR file is copied from the developer’s laptop and now you have no record of what was deployed.
Thank you to the following co-conspirators for their contributions to this list: C. Scott Andreas, Matthew Baltrusitis, Zac Duncan, Dr. Nicole Forsgren, Bea Hughes, Kyle Kingsbury, Toby Kohlenberg, Ben Linsay, Caitie McCaffrey, Mikhail Panchenko, Alex Rasmussen, Leif Walsh, Jordan West, and Vladimir Wolstencroft.
Enjoy this post? You might like my book, Security Chaos Engineering: Sustaining Resilience in Software and Systems (with Aaron Rinehart), available at Amazon, Bookshop, and other major retailers online.
This brings to mind Vonnegut’s advice of “Be a sadist. No matter how sweet and innocent your leading characters, make awful things happen to them—in order that the reader may see what they are made of.” ↩︎
As “Duskin” rightly noted in an investigation of a fire at an ammonia plant back in 1979: “If you depend only on well-trained operators, you may fail.” ↩︎
These sorts of seldom-used libraries are much less likely to be poisoned than the mainstream libraries which occasionally have CVEs, but infosec folks ambulance chase off them until our sanity is flattened and bloodied like roadkill. ↩︎
And why should traffic in pre-prod be as high as prod? Replaying all traffic to pre-production all the time is expensive af! So it’s a reasonable assumption, in isolation. ↩︎
but oh honey why are you performance testing an option that’s faster than what you’ll actually deploy?? ↩︎
The options are even more misleading than you might expect.
--no-cache only inhibits the cache for layers created by the Dockerfile and does not skip the image cache. You need
--pull to skip the image cache. ↩︎
Usually linkers order the objects they’re instructed to link by the order they’re presented. If you specify the order, you’ll always get the same order. If you have Make or whatever build system you’re using send the linker all the .o files in the directory, it will send them in the order the filesystem lists them, which can change depending on some internal filesystem properties (usually what order their metadata was last written). Usually it doesn’t matter, but maybe the code has some undefined behavior based on the layout of the code itself. Maybe there are static initializers that get run in a different order and some data structure is corrupted before the program even starts doing anything useful. ↩︎
Action bias is a bitch. See also a recent paper I co-authored: Opportunity Cost of Action Bias in Cybersecurity Incident Response ↩︎
I did not have to come at myself this hard. (That’s what they said). ↩︎
The original quote by Helmuth von Moltke is “No plan of operations reaches with any certainty beyond the first encounter with the enemy’s main force.” from Kriegsgeschichtliche Einzelschriften (1880). It is commonly quoted as “No plan survives first contact with the enemy.” ↩︎
“Skin in the game” is such a strange idiom. It makes me think of skeletonless fleshlings flailing around on a football pitch trying to flop wobbling meatflaps at the ball. Neurotypical lingo never ceases to amaze. ↩︎
You might think that with the growth of data collection, machine learning, and other
flagrant privacy violations business intelligence practices that data storage is the primary dimension of capacity planning. This is often not the case. In the last decade or so, capacity has grown phenomenally but throughput and latency have not kept pace. As a result, IOPS and throughput are more commonly the bottleneck that needs planning while storage capacity is overprovisioned. On the cloud, allocated throughput and IOPS are assigned based on volume size, so it’s common to see vast overprovisioning of volume size to realize sufficient IOPS. It also occurs on storage SANs, where the number and capacity of disks are selected to match the required sustained read and write rates. All of this is phenomenally complicated but as a first approximation, IOPS and throughput matter more than storage capacity for many use cases. ↩︎
Paraphrased from Chapter 2 of This Side of Paradise by F. Scott Fitzgerald: https://www.bartleby.com/115/22.html ↩︎
Nancy by Olivia Jaimes for Wed, 11 Jan 2023 gocomics.com/nancy pic.twitter.com/fvRZY97uEP
I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycomb’s incident retrospective process from the inside. I wasn’t disappointed!
The first retrospective I took part in was for our ingestion delays incident on September 8th. Our preliminary report promised that we’d post more about what happened after our investigation concluded, and the retrospective meeting that I attended was part of that work. Later on, we posted our full analysis.
Right at the start of the retrospective meeting, Fred Hebert blew my mind by reading out the Ground Rules, which I’ll paraphrase here:
There’s so much to love in this intro! I’ve been learning about these concepts for years and trying to slowly incorporate them into the incident retrospective culture around me. I was pleasantly surprised to hear that these ideas were already firmly instilled in Honeycomb’s culture.
Let’s look at the ground rules in a little more detail to find out why.
I first came across this concept in the Etsy Debriefing Facilitation Guide, and since its publishing, I’ve watched long-standing best practices shift toward an emphasis on learning versus action items. The Howie guide for post-incident analysis by Jeli is another example of an incident analysis framework that embodies this idea.
I have to admit, my thinking on this topic has changed over the past few years. Heck, I co-led an entire conference session on running incident retrospectives that held remediation items as the main goal. However, I now see that we learn so much more when learning is the focus. Searching for remediation items actively gets in the way.
“Why did it make sense to make that decision?“ Ask this question in an incident review and you’ll learn more about your sociotechnical system. This one question sets the tone, making those involved in the incident feel safer because they know that everyone is assuming they made the best choice they could at the time based on the information they had.
It’s worth noting that we don’t say “blameless” directly. Instead, we use “blame-aware.” It’s okay to talk about who did something, provided that the discussion is sanctionless; no one is going to be punished for decisions they made in good faith.
In an incident review, a counterfactual question asks, “What should we have done?” This kind of question is dangerous because it conjures up a reality that did not exist. In the process, it brings undertones of blame that will engender defensiveness and stifle the investigation. By phrasing our questions in the form of how we can act in the future, we acknowledge the reality that everyone did the best they could during the incident.
Finally, the ground rules encourage asking questions, even if the answer seems obvious. An incident review is about finding out where our mental models of the system broke down—and bringing those models closer into alignment with the way the system actually works. Everyone’s model is an approximation, and a different one at that. Your question helps you improve your mental model, and almost certainly will help someone else too. Ask it!
Creating and publishing the ground rules for incident investigations is the first part, but that’s not enough. I experienced firsthand how important it is to read them aloud before every single retrospective meeting.
In any meeting, chances are there’s someone new who hasn’t heard the rules before. For those who’ve heard them before, it provides an important reminder that tone and mindset are critical to promote learning as much as we can from each incident. The end result is to create an inviting learning environment where everyone feels safe contributing and we all get to learn together as a group.
I’ll end this article by telling you that we’re hiring! If the culture at Honeycomb sounds like it’d be a good fit for you, check out our careers page and see if there’s a match for you.
The post The Incident Retrospective Ground Rules appeared first on Honeycomb.
They’re everywhere. In Slack: “hey, can I get a review on this?” In email: “Your review is requested!” In JIRA: “8 user stories In-Progress” (but code-complete). In your repository: 5 open pull requests. They’re slowing your delivery. They’re interrupting your developers.
We could blame the people. We could nag them more. We could even automate the nagging!
Let’s face it: nobody wants to review pull requests. And for good reasons! It takes a lot of time and work. Chelsea Troy describes how to do pull request review right:
In addition to pulling down, running, and modifying the code myself… A maximally effective pull request suggests solutions…in code. It points out what’s working and what’s not, and links to documentation where useful. It highlights laudable work by the original developer, and asks questions before making assumptions or judgments. It explains the reasoning behind suggestions, whether that reasoning affects functionality or adheres to convention. In short, it demands the reviewer’s full participation in finishing the solution that the original developer started. And it prepares the reviewer to take responsibility for this code in the event that the original developer were unable to complete it.Reviewing Pull Requests – Chelsea Troy
Reviews, done right, have all the painful parts of a software change: understanding what the change is for, loading up the relevant code and tests into working memory, getting my local environment up to see the change, making the tests run. They have none of the fun parts: refactoring to clarity, changing code and seeing a difference. They take hours of time and all my concentration away from whatever it is that I’m personally trying to do.
On top of that, they’re a social interaction minefield! This variable name confused me at first but now I see why they called it that. Should I suggest a change, and require the other developer to do a whole context switch again to improve it? Probably an asshole move. This test doesn’t cover all the cases; I can see one that’s missing. Request another, like the pedant I am? or figure out how to write it myself, adding another hour?
There’s a cost to every comment, a cost to the submitter’s sense of belonging. A responsible reviewer looks at consequences far beyond the code.
Of course I never want to review pull requests. It’s mentally taxing, takes a lot of time, might damage relationships, and gets me nowhere on the task that has my name on it.
So the twitterverse is asking, how do we get people to do it anyway?
If this is what we’re asking, maybe something is wrong with our priorities.
What does it say about us that no one wants to review pull requests?
Maybe it says that we trust each other.
Maybe it says that our team has too many concurrent tasks. And by “too many” I mean “more than one”!
What is our goal with this pull request process? There are several, but I think the primary one is: safe, understandable code. It looks safe to deploy, and it is clear enough to be understood by the rest of the team. Tests can give us confidence is safety, but only a person can evaluate “understandable.”
To change code, a developer first has to understand the code, and understand the change. If the developer was the last person to change this code, then they just have to load it into memory. They’ve understood it before. This should also be true if they reviewed that last change — pull request review spreads that understanding a bit.
A developer gathers this knowledge, then uses it to make decisions about the code. They probably iterate on it a few times, and then they submit something they consider safe and understandable.
But is that code really safe and understandable?!? We must ensure it! Let’s add this whole process again, except the decisions are approval instead of what to change. We’ll make this asynchronous, yeah, so the submitted can start a whole different task. And if the decision is “no” then we’ll make another asynchronous task and everybody can context switch again!
This defies everything we know about product development flow. We just increased WIP and slowed our response time by adding a wait into the process (at least one wait, really an indeterminate number).
Like Patricia said, maybe this process developed for open-source projects isn’t the best for our in-house teams. Maybe there are better ways to work together.
The pull request process results in code that two people understand. What if we aimed higher?
How about: the team makes all code changes as a unit. Ensemble working (the practice formerly known as mob programming), with one shared work product and all the shared knowledge. It will be as safe as everyone can make it, and more than understandable: it’ll be understood by the whole team.
Not every team member will be present every day. Let’s take a page from distributed systems and require a quorum of team members present when we make code changes. At least 2 developers on a team of 3, at least 3 on a team of 5, etc. That way, whenever it’s time to change that code again, someone present was involved in the most recent change.
Then there are no queues or waiting, only collaborating on getting the best name, the complete-enough test suite. Every refactor increases the whole team’s understanding of the code. The team develops a common understanding of the code and where it is going, so they can do gradual improvements in a consistent direction.
Does that sound inefficient? Consider the inefficiencies in the queuing for pull requests, the task switches. Not to mention the merge conflicts we get after the pull request sits open for days.
Does it sound wasteful? All that programmer-time dedicated to just one task, when we could be doing three! Well, ask: which of those three is the most important? Why not get that out as quickly as possible and then work on the others? And it is faster, when you never have to ask permission or wait for answers because all the relevant knowledge is right there. (It helps to bring in other people too, when you need knowledge from an adjacent team or specialist.)
Does it sound miserable? Many people hate pair programming; this sounds even worse. Strangely though, it’s better. When there are three or more in a session, there’s less pressure to stare at the screen every second. One person’s attention can wander while the group attention stays. A person can go to the bathroom or answer an urgent question on Slack, while the ensemble remains an ensemble. Pair programming is more exhausting.
Does this seems like an all-day meeting? No, only when we’re changing code. There’s a lot more we do in a day. There’s still email! Each of us has knowledge to acquire and knowledge to share with other teams. I only have six hours of focused brainpower in me on a day. I’d aim for five hours of direct collaboration, and not change production code outside of it.
Does this seem impossible remote? It is harder. Set up a shared development environment that everyone can connect to for switching. Or start a branch and use git to move code around. Turn your video on, but set up a screen and camera over to the side, so that looking at each other is different from looking at the code. Staring at each other is draining. Working alongside each other is invigorating.
(TODO: take a picture)
Is your team too large for this? It does get ridiculous with 8-12 people in one meeting. That’s a smell: either your application is too big (it takes that much knowledge); can you split it? Or, someone thought adding people would speed the work. This is a classic Mythical Man-Month problem.
When working together eliminates all the coordination work and merge pain, the team can be smaller and more responsive.
When we divide tasks among people, we can say “we’re working on it” about several things at once. Is that something your organization wants? If so, then it is holding your team back from focus. If this is the organizational API you need to meet, try marking five tasks “in progress” in JIRA, then working one at a time together.
The team works most smoothly as a unit. Production software needs a team behind it because so much knowledge is required: the purpose of the software, its customers, its interfaces, all the tech it runs on and the data it stores and all the changes in the world (such as vulnerabilities) that it needs to respond to. It takes several people to hold all this, with redundancy. To change software safely, combine all that knowledge. We can do this efficiently together, or painfully alone: asynchronous, with a lot of coordination and unpredictably stepping on each other.
We know that code review improves outcomes– compared to coding alone without any review. Don’t do that. Do code together — with constant, live review and growing understanding between the team members and the code, between the team members and each other.
Leave the pull requests for collections of individuals sharing a codebase. Give me direct collaboration on my team.