Smoke Testing Your Infrastructure With Runbooks

Matthew CaspersonMay 11, 2022 • 6 min read

The internet is broken!

Anyone who has spent some time on a help desk has heard this, and other equally vague descriptions of issues customers run into. Getting actionable information is half the battle when diagnosing an issue.

When supporting complex infrastructure though, it can be hard to know how the system was designed, making it hard to know what questions to ask and where to find information to help resolve the issue. It’s the nature of custom applications typically found in enterprise environments that each is an evolution of the last, or written by an entirely different team using a unique approach each time. This means business knowledge around how an application should be supported is often only found in the heads of a few employees.

Runbooks provide a way to capture this business knowledge in an automated and testable way, ensuring the support team can quickly diagnose high level issues and efficiently respond to customer requests.

In this post, I provide an example runbook aimed at the level 1 support team, designed to smoke test a typical microservice application in AWS.

Prerequisites

This post assumes that the runbook steps are run on a Linux Worker. They use dig for DNS lookups, hey for load testing, curl for interacting with HTTP endpoints, and the mysql client.

To install the tools in Ubuntu, run the following command:

apt-get install curl dnsutils mysql

To install the tools in Fedora, RHEL, Centos, or Amazon Linux, run:

yum install curl mysql bind-utils

Then manually download hey with the command:

curl -o /usr/local/bin/hey https://hey-release.s3.us-east-2.amazonaws.com/hey_linux_amd64
chmod +x /usr/local/bin/hey

We created a public Octopus instance with this runbook defined against a live microservice. Log in with the guest account to view the runbook steps and list the results of previous executions.

Smoke testing DNS

DNS lets you map friendly names, like development.octopus.pub, to IP addresses, like 52.52.151.20.

DNS is usually a stable service, but when it fails it’s likely no other networked service will function correctly. For this reason, the first smoke test you want to perform on services exposed to the internet is to verify that the DNS name can be resolved.

The script below executes dig to inspect the DNS records associated with a domain name:

dig "#{Octopus.Environment.Name | ToLower}.octopus.pub"
# Capture the return code of the previous command. This will be used as the exit code of the entire script once we print out
# any further instructions.
RETURN_CODE=$?
echo "============================"
echo "HOW TO INTERPRET THE RESULTS"
echo "============================"
echo "The dig command returns a lot of technical details, most of which are not important from a level 1 support point of view."
echo "As long as the command passes, you can assume the DNS is correctly configured."
echo "If the command fails, escalate this issue to level 2 support."
# Exit the script with the return code from the smoke test
exit $RETURN_CODE

The output of this script looks like this:

Tools like dig tend to be quite technical, and the output requires some experience to interpret. However, your level 1 support team doesn’t usually need a deep understanding of networking issues like DNS, so the script must explain the results and any further actions. This is an example of capturing business knowledge in a runbook, and it means even new starters can run these runbooks and be confident in responding to the results.

Smoke testing MySQL

In this example, our application uses a MySQL database for persistence. If the database isn’t accessible, the service will fail, so the next step is to write a script to smoke test the database.

The script below uses the mysql command-line tool to attempt to query a known database table. Note the results are redirected to /dev/null, as we don’t want to populate the logs with actual database records:

DATABASE_HOST=$(get_octopusvariable "Database.Host")
USERNAME=$(get_octopusvariable "Database.AuditsUsername")
PASSWORD=$(get_octopusvariable "Database.AuditsPassword")

echo "Database Host: $DATABASE_HOST"
echo "Database Username: $USERNAME"

mysql --host=$DATABASE_HOST --user=$USERNAME --password=$PASSWORD audits -e "SELECT * FROM audits" > /dev/null
# Capture the return code of the previous command. This will be used as the exit code of the entire script once we print out
# any further instructions.
RETURN_CODE=$?

echo "============================"
echo "HOW TO INTERPRET THE RESULTS"
echo "============================"
echo "This test attempts to query the audits database."
echo "If this step fails, escalate the issue to level 2 support."

# Exit the script with the return code from the smoke test
exit $RETURN_CODE

Smoke testing HTTP services

The next test verifies that public HTTP endpoints respond with the expected status code. Web services always return a status code with any response, and typically you can assume that a public URL will return a code of 200, which indicates a successful response.

For a complete list of HTTP response codes, refer to the MDN documentation.

For this test, we make use of a step in the community step template library called HTTP - Test URL (Bash). This step defines a Bash script that calls curl against the supplied URL and verifies the HTTP status code:

Load testing

The previous three smoke tests verify the fundamental layers of our application’s infrastructure. You can expect that if any of them fail there’s a serious issue.

However, it’s still possible the application is working, but is unusable because it’s slow or randomly fails to requests. Your final smoke test performs a quick load test using hey to verify that the application responds consistently to multiple requests. The script below calls hey against the microservice API:

# Warm up
hey https://#{Octopus.Environment.Name | ToLower}.octopus.pub/api/audits > /dev/null

# Real test
hey https://#{Octopus.Environment.Name | ToLower}.octopus.pub/api/audits
# Capture the return code of the previous command. This will be used as the exit code of the entire script once we print out
# any further instructions.
RETURN_CODE=$?

echo "============================"
echo "HOW TO INTERPRET THE RESULTS"
echo "============================"
echo "It is expected that the majority of requests complete in under a second."
echo "If the chart above shows the majority of requests taking longer than a second, please escalate this issue to level 2 support."
# Exit the script with the return code from the smoke test
exit $RETURN_CODE

The output from this script is shown in the screenshot below:

This output requires some interpretation to decide what further action to take. The histogram shows the response time for each of the requests, and in this example you’d expect the majority of requests to be completed in less than a second. The instructions guide support team members running this script to make the appropriate decision based on the output.

Conclusion

Every application you encounter in an enterprise environment requires a multitude of underlying services and infrastructure to be operating correctly. By writing smoke tests that probe and verify those layers, support teams can quickly confirm issues and respond to support requests efficiently and confidently.

In this post, you saw examples of smoke tests that verified DNS services, HTTP endpoints, and MySQL databases. You also saw a simple load test that provided insight into the performance of a service when responding to multiple requests.

Read the rest of our Runbooks series.

Happy deployments!

Tags:

Matthew CaspersonPublished: May 11, 2022

How high performers turn compliance into an advantage

We dive into our Compliance through Continuous Delivery report to understand just how high performers make compliance their advantage

Charlotte FlemingNovember 13, 2025 • 5 min read

A photograph of Luke Philips surround by a circular blue and purple frame. Luke is smiling.

Inside Platform Engineering with Luke Philips

Luke has spent years working with major companies, including The New York Times and Charter Communications, and he brought refreshing honesty to our conversation about what Platform Engineering actually looks like in practice.