Recovering from Failure
“Anyone who has never made a mistake has never tried anything new” — Albert Einstein
It’s within all of us to epically fail at least once in our lives. For most of us, failure is how we learn and adapt to the world around us. It’s what teaches us how to grow and become better versions of ourselves. Unfortunately for technology companies Failure has become a bit of a bad word. It’s not so much the failure itself that frightens us, but conversely it’s that we are terrified of the negative consequences associated with Failure. The obvious response to a failure in the technology arena is to erect barriers, which are designed to save us from failing again. However those barriers eventually hinder our ability to innovate.
Some of the more notable business failures that have caught my attention in recent years include:
- The Obama care web site
- Apple iPencil (Seriously? this was a silly idea from the beginning)
- Amazon’s AWS outages
In our personal lives we try to save money for rainy days, we purchase auto insurance to cover us “just in case”, we have home owners insurance, health insurance, life insurance and insurance to insure the insurance…. all to make sure we are protected. It would seem we work very hard to limit the blast radius of failure and catastrophe within our personal lives. So… how can we do this in software development and technology?
These got me thinking, about how to mitigate the risk of catastrophic failure in the software systems we create. How can we reduce the impact of a given failure. So I asked myself: Are there ways we can fail strategically instead of clumsily? Can we limit the blast radius of a given failure? Can we lower the cost associated with failure?
Limiting the blast radius of failure within the software stack and live environments:
I this section, I want to explore how we can mitigate failure, and lower the risk of catastrophic failures within a software solution. What tactics can we employ preemptively to make our software solutions more resilient?
1. Identify and add redundancy around Single Points of Failure [SPOF]:
This tactic has been around for quite some time. It’s reasonably effective and simply involves a bit of fore front planning. Basically you work to identify components and devices that only have 1 point of communication, and add redundancy to them (duplicate them). This ensures that if one of those components were to fail, there would be additional ones available to take the place of the failed one.
2. Modularize the architecture of the software and add standards to how those modules are created and communicate with one another.
Back in the day software architecture was considered a highly complex art that only a few could do properly. Components were intertwined, communication between the components created highly complex dependency trees, and the end result was a HUGE ball of mud that couldn’t be untangled. Enter modularization:
The technology industry (for the most part) has since begun modularizing components, and defining architectures that were more scalable, interchangeable, and resilient. A HUGE step in the right direction!
Now we have micro-service architectures, containerized infrastructure, and container orchestration solutions with service discovery that provide highly scalable ways to develop and release software.
Developing micro-service standards (similar to SOA standards) within an organization helps define how a services should communicate with each other. Such standards also ensure the services datastore’s schema is respected and not directly depended upon. Amazon put forth this requirement when initially rolling out their Service Oriented Architecture proposals back in 2005. Its worked fairly well for their organization ever since. Let’s look at some micro-services oriented best practices:
– Each service should have a public or external facing API and that API should be versioned (public doesn’t always mean on the public internet)
– Each service should have its own database and NOT rely on a central database that other services use UNLESS an API layer is created to abstract the data setting and retrieval process (Look at NetFlix’s API and RailCar solution for Cassandra)
– Communication between services should be done via the API
Below is an example of this type of architecture (monolith vs micro-service):
3. Remove the database schema as a hinge to development work and releases:
One of the big mistakes I’ve seen in software architecture is a dependency on other components database schemas. Here is how this problem occurs. Component A and Component B rely on the same database (say a central Cassandra instance). Each rely’s on a specific table and set of columns in the table. As these components evolve they both want to add / remove / change their use of the tables BUT since they rely directly on each others tables it makes releasing an update to the system more monolithic than modularized.
Below is a graphic of how to avoid schema dependencies:
4. Deploy small easily debuggable chunks of code at a time:
One final tip within the realm of mitigating failure is to make small changes to the software system and release those small changes frequently. This in many circles is known as Continuous Deployment. This makes tracing what changed when something goes wrong SIGNIFICANTLY easier. It also requires more communication and collaboration when developing and releasing code because it makes branches virtually useless. If you deploy 100 times a day in tiny 10 line increments, you can easily figure out which of those deployments and code lines broke something. Conversely if you deploy once every 6 months, umm who knows what actually broke because well let’s face it you deployed a million lines of code.
5. Leverage A/B testing techniques and Canary Deployments
A/B testing is an excellent way to determine if the target audience enjoys the feature or implementation. It basically revolves around the concept of collecting metrics around the current usage of a given set of features and allows the business to pivot development toward the one more popular. Hew is a VERY simplistic example of A/B testing.
Canary deployments are very similar in nature. It means simply rolling a feature out gradually to a set of users. This can be coupled with A/B testing to ensure a broken feature ONLY impacts a subset of users. AND because of the logic and metrics gathering the company will know that the failure is present. Nice right?
6. Blue / Green or Red / Black deployments:
This implementation is really around how to deploy upgrades. It involves rolling out a copy of a component already in production but leaves the current live one in place. Once the release is deemed viable, traffic is flipped from the old instance to the new one. If things break down traffic can then be re-routed back to the old instance. Below is an example:
The above list isn’t by any means 100% complete, BUT it should help provide a cursory overview of how to lower the risk of failure. The key take away is that instead of running from failure and hiding, learn from it and do things in a more intelligent way the next time. Break things into smaller chunks, bring the pain points forward and deploy more frequently. By doing this we can increase resiliency and get to a mundane release process without major hiccups.