Modern Rollback Strategies

Bob WalkerOctober 2, 2023 • 18 min read

Rollbacks are the escape hatch everyone wants.

Harmful code made it to production. The website refuses to start. Users get error messages when browsing. It’s time to roll back. Yet, the rollback scripts weren’t tested. You last rolled back 4 years ago. The last 20 production deployments took less than 30 minutes. Will this rollback work? You don’t know.

In this post, I walk through 3 modern rollback strategies based on recovery time. Recovery time is how long it takes to go from “problem found” to “restored.”

A brief primer on rollbacks

Before getting started, let’s define rollbacks.

A rollback is reverting to a previous version of the application.

There is nuance in reverting to an application’s previous version. The reversion could involve redeploying the previous version. Or, the reversion could update a load balancer. The strategies presented in this post will cover both.

Why roll back?

There are many reasons to roll back.

The application won’t start
Users can’t sign in or use the application
Incorrect username, password, or token for database authentication
Expired credentials
Missing or incorrect configuration entries
During verification, testers find a “showstopping” bug
You need to deploy one or more external services on schedule
The database migration scripts failed

That is not an exhaustive list. I’m sure you have many more reasons why you’ve rolled back.

Rollbacks can be a symptom of a bigger problem

Many of the reasons listed above are a result of human error. That occurs when the software delivery pipeline relies on too much manual intervention. Manually building code, deploying build artifacts, and verifying changes is a recipe for disaster. I can’t count how often a configuration was the result of human error.

I recommend you follow the principles in Dave Farley and Jez Humble’s book Continuous Delivery. Doing so will force you to automate many manual tasks. Also, you’ll follow the same deployment process for all environments. That will prevent many of the reasons to roll back.

Rollback risks

No rollback is risk-free. Part of the rollback process is deciding to roll back. Often, it’s a question of which option is riskier. Sometimes, a roll-forward is less risky than a rollback. Other times, a rollback is less risky than a roll-forward.

Even if your deployment schedule is once a week, each deployment will have a mix of fixes and features. The fixes could squash a critical bug or close a security vulnerability. You cannot choose to roll back a specific change when rolling back. All the changes roll back, or nothing rolls back.

There’s no magic bullet, or one size fits all solution

I often talk to prospective customers looking for an “easy” rollback solution. They want a tool or process that will immediately roll back their application. A process that covers all their rollback scenarios with a minimal amount of work.

Such a solution does not exist.

The 10-minute strategy detailed below should take less than a day to configure. It covers the majority of rollback scenarios. Having an automated deployment process removes many of the reasons rollbacks occur. But, the process only covers some possible rollback scenarios.

Implementing an immediate rollback strategy for all scenarios requires architectural and process changes. It goes beyond the deployment tool.

I’m not referring to deployment process changes. An immediate rollback strategy changes how you make code and database changes in your application. It requires a robust automated suite of tests. That all takes time and money to create.

Don’t rollback the database

The database is a critical component of your application. It stores all your user data. Unless a catastrophic event happens, data loss is unacceptable. Data loss has real-world impacts. Imagine losing a lifesaving prescription from a patient database in a hospital. Or a loan entry to save a family farm.

There are 2 ways to roll back a database—both of which will likely result in data loss:

Run rollback scripts
Restore a database backup

Rollback script pitfalls

Missing or corrupted data impacts your users. Relational databases are responsible for referential integrity. The code base handles business rules.

Despite that, we write rollback scripts to remove or manipulate data. For example, you add a column to an existing table. You need to roll back that change. The rollback script removes that added column. Simple. But, that script may not take into account nuances.

Was there any data in that column? If so, who added it, the users or the migration script?
Will users have to re-key the data after that column is re-added?
Is that column used as a foreign key to other tables?

Restore database backup pitfalls

A database backup has a finite useful lifespan. A database backup from 15 minutes ago is much more useful than one 24 hours old.

Restoring a database backup will result in data loss. The restoration will remove any data changed since that backup. If you’re lucky, no users used the application since the database backup creation. But luck isn’t a strategy.

Restore databases as a last resort

Restoring database backups should be a last resort. It’s not the default option. Only do it when a catastrophic error occurs and there’s no other option. For example when:

Several disks on a SAN fail
A hurricane hits the data center hosting the database
The database file gets corrupted due to a bug

Take the database out of the rollback process.

The 3 rollback strategies below rely on never rolling back a database. They rely on every database change being a roll forward.

The risk of data loss or something going wrong is too high. With enough time and effort, you can mitigate those risks. But how often do you plan on rolling back? It should be an exception, not the norm. You’ll end up spending a tremendous amount of time and money working on a script that will never run. That time and effort could be better spent on improving the rollout process.

Taking the database out of the rollback process makes rollbacks more workable. You’ll still have challenges, but they won’t be “data lost” challenges.

10-minute recovery rollback strategy

The 10-minute recovery rollback strategy redeploys the previous application version. But during the redeployment, it skips any database steps.

Skipping steps with the 10 minute overview strategy

How it works

The post-deployment verification discovers a catastrophic problem. Should you roll back? Should you roll forward? Before deciding, review the database migration scripts.

No schema changes = Safe to rollback.
Existing column added to a select stored procedure or view = Safe to rollback.
New nullable columns added = Developers must review before rolling back.
Significant schema changes = Unsafe to rollback.

The next decision is how long it will take to fix the issue. Is it a configuration issue? Is it an undiscovered bug?

Assuming it’s safe to roll back, you redeploy the previous code version. The deployment process detects it’s a rollback and skips the database steps.

Octopus Deploy configuration

Of the 3 rollback strategies, this requires the most configuration.

Step 1. Add the community library step template Calculate Deployment Mode to your process.

Step 2. Move that step to the start of the deployment process.

Adding the calculate deployment mode step.

Step 3. Set the run condition to Variable for each step to skip during rollback. Set the value to: #{unless Octopus.Deployment.Error}#{Octopus.Action[Calculate Deployment Mode].Output.RunOnDeploy}#{/unless}. I’m skipping steps 2, 3, 4, 5, and 6 for my example application.

Adding the variable run condition to each database step

Step 4. Test the changes. Create a release and deploy it to a test environment. You’ll see the skipped steps run for the deployment.

Step 5. Create another release and deploy it to the same environment.

Step 6. Redeploy the first release.

Redeploying the previous release via the overflow menu.

Step 7. In the deployment, you’ll see all the database steps skipped.

All the database steps are skipped on rollback.

Recovery time explained

The redeployment will not take long unless you have a complex process. That’s because you are skipping the most time-consuming steps. It’s common to see a redeployment take one-third to half of the time.

You spend the majority of the recovery time reviewing database migration scripts. Is it safe to roll back? Would a rollback introduce too much risk and unknowns?

Who the strategy is for

Anyone not doing rollbacks today should consider this strategy first.

You only need to change the deployment process, not how you make database changes.
With minimal effort, you have covered most use cases.

After I implemented automated deployments, the vast majority of rollback requests came from QA. Development and test environments are unstable during normal software development. An unexpected bug can slip through despite pull requests, code reviews, and automated tests.

There are far more code changes than database changes. Especially in development and testing environments. It’s common to see 20 code check-ins for every database check-in. There’s a high likelihood you can roll back a showstopping bug.

Downsides and pitfalls

The primary downside of this strategy is you can only roll back in some situations. A breaking schema change prevents rollbacks. The only course of recovery is to roll forward. That could take hours or days.

Deployment automation removes a lot of reasons to rollback

Don’t dismiss this strategy because of the potential for multi-hour recovery. That’s looking at this strategy through the lens of manual deployments. Look at it through the lens of automated deployments. Many scenarios requiring a rollback don’t apply with automated deployments.

Source control houses all the application configurations. The deployment tool automatically applies those configurations.
Code is automatically built following the same process. No steps are accidentally skipped.
The same build artifact gets deployed to all environments. What passed testing, automated or manual, goes to production.
The build artifacts are automatically deployed using the same process for all environments. The process gets tested at least once before production.
Most automated deployments take 10–15 minutes to complete. Pushing a fix through all the required environments can take less than an hour.
Most failures requiring a rollback will now occur in testing environments, not production.

This strategy requires minimum effort but has a significant pay-off.

3-minute recovery strategy

The 3-minute recovery strategy decouples the database changes from the code changes. You need to deploy the database changes hours or days before code changes. The previous application version will work with the current database changes. That removes a significant risk when rolling back.

Screenshot of the deployment processes for 3-minute recovery.