It Works On My Cluster: A Tale Of Two Troubleshooters

Liam MackieDecember 9, 2025 • 9 min read

Kubernetes has a gift for making simple problems look complicated, and complicated problems look simple. When something breaks, you often see symptoms completely unrelated to the real cause of the problem. This leads to a problem I like to call “blaming the network team”, where problems end up being diagnosed by the wrong engineers for a given issue.

This is a real problem in the industry - not only for large enterprises, but also for smaller organizations that are transitioning from “Steve-based deployments”, where teams are learning to work separately yet still together.

I’ve personally experienced this dichotomy during my time as an engineer, working on both software and infrastructure, so I’m going to tell a story from two perspectives: the infrastructure engineer who initially looked at an issue, and the software engineer who was able to help solve the problem. It’s based on a real-world issue I experienced at a previous company I worked at, though I had much less fun solving it in real-time than writing this blog post.

I’ll discuss the tools used, the rabbit holes I dove into, and how I (eventually) learnt from our mistakes. Ultimately, discussing this will help others learn from my mistakes and enable teams to work together more effectively.

The backstory

The application that started it all is a GraphQL “gateway”. Written in Node.js, its job is to receive requests from clients, then connect to various backend services to retrieve data, stitch it together, and send it back. To say it’s in the critical path is an understatement - if it’s unhappy, so are customers.

One Thursday morning, it became very unhappy.

We started to receive reports from customers about timeouts and error messages. Our support team had a look at the issues they were seeing and immediately escalated to the infrastructure team - they were seeing strange errors:

Error: getaddrinfo EAI_AGAIN accounts.production.svc.cluster.local

The support team saw getaddrinfo, googled the problem, and quickly decided it was a DNS problem - time to page the infrastructure engineer!

Part 1: The Infrastructure Investigation

When you’re the infrastructure engineer and DNS errors appear, you need to move quickly. DNS is fundamental, especially in Kubernetes, where it’s often used in lieu of service discovery.

Starting With What You Know

First things first: are the CoreDNS pods actually running?

❯ kubectl get pods -n kube-system -l k8s-app=kube-dns

NAME                       READY   STATUS    RESTARTS   AGE
coredns-7c65d6cfc9-grq72   1/1     Running   0          2d20h
coredns-7c65d6cfc9-z5svs   1/1     Running   0          2d20h

All pods are running, with no restarts or changes since the issue started. Sweet - let’s check the logs.

❯ kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.3
linux/arm64, go1.21.11, a6338e9
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.3
linux/arm64, go1.21.11, a6338e9

The logs are empty, but at least there’s no errors!

Expanding the Search

If CoreDNS is healthy, it could be a general networking issue. We should quickly test from a fresh pod:

❯ kubectl run debug-dns --rm -it --image=busybox --restart=Never -- nslookup accounts.production.svc.cluster.local

Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   accounts.production.svc.cluster.local
Address: 10.96.80.99

pod "debug-dns" deleted from production namespace

It works! DNS resolution is working, at least from a debug pod, but the error we’re seeing is definitely pointing to DNS timeouts.

Checking the Metrics

Perhaps it’s a capacity issue that only becomes apparent under load. Let’s look at the CoreDNS metrics:

Image of grafana, showing CoreDNS metrics that are normal

Everything seems normal, but just in case, let’s try scaling the coreDNS deployment:

❯ kubectl -n kube-system scale deployment coredns --replicas 3

deployment.apps/coredns scaled

After a few minutes, nothing has changed!

Let’s try with load

Everything indicates that DNS working fine, but the gateway is still failing. It could be due to the load on the DNS servers. To test the hypothesis, we load test the DNS servers:

kubectl run debug-dns --rm -it --image=busybox --restart=Never -- sh -c 'while true; do nslookup accounts.production.svc.cluster.local > /dev/null || (echo "failed" && break); done'

This pod simply ran nslookup until DNS failed, except it didn’t fail.

The Dead End

At this point, we’ve spent the entire morning verifying that the DNS infrastructure is completely healthy. CoreDNS was working, the network was working, and other pods were resolving DNS without any issues. However, the gateway continued to throw EAI_AGAIN errors.

The error message said DNS was broken. Every check said DNS was fine. At this point, we’re certain that something has changed in the software itself; it’s time to throw this issue over the wall to the software team responsible for the gateway.

Part 2: The Developer Investigation

Getting a visit from the infrastructure guy is rarely good news. “DNS is broken for your app, but I’ve investigated DNS, and it’s working fine!” is particularly bad news, though.

Checking the deployment history

Since DNS is working fine for everything else in the cluster, something must have changed to trigger this issue - perhaps there was a recent deployment? We’ll check in Octopus Deploy to see whether there’s been any changes:

A screenshot of Octopus Deploy, showing the deployment history

It looks like yesterday, your recent change made it through to production:

A screenshot of Octopus Deploy, showing the deployment timeline for production

You’re familiar with this change - it’s a small performance optimization that adds a local cache. Instead of fetching and parsing schemas from backend services on every request, the gateway cached it to disk to look up when a request comes through. This shouldn’t impact DNS at all! Let’s verify the manifest that was deployed:

Diff view, showing only changes to the image for the deployment

It’s only an image change - is it in sync in the cluster?

Live status view, showing in-sync resources

The resources all seem to be in sync, so they’re definitely unchanged.

Looking for patterns

Live status, showing logs from the problematic pod .

You can see the error logs with DNS issues in production, but when you examine your dev and test environments, none of them are showing the issue. In addition, you only see the DNS errors when there are lots of requests coming in at once. Maybe that’s why the issue isn’t showing up in the test environment!

Replication

To try and replicate the load of the production environment in the test environment, you boot up k6 and start generating GraphQL requests. As the load goes up, you wait to see DNS errors, but … nothing! All that happens is that the service slows down slightly, even at over 10 times the load that production is experiencing.

You’ve spent most of the afternoon trying to replicate the issue in the test environment, and still, nothing - it’s looking like we might need to start the complicated rollback process before the end of the day.

Part 3: Bridging the Gap

Before finishing up for the day, everyone catches up and discusses next steps. They go over the facts first:

The errors received are definitely related to DNS
No other pods are receiving DNS failures
There doesn’t seem to be networking issues
The DNS servers don’t seem to be having issues
The DNS failures seem to start when there’s lots of load
The same build of the software in test doesn’t have the problem, even when load tested
The only change made was to add caching

This leads the infrastructure engineer down the path of the differences between environments:

In test, there’s only 1 replica
In production, there’s 3 replicas
In test, the single replica uses an SSD volume
In production, the replicas use a shared disk

Maybe this is somehow related to the shared disk?

Following the Thread

A search for DNS errors Node.js Kubernetes file locks turned up a Medium article, along with a suggestion to bump the UV_THREADPOOL_SIZE to help fix DNS resolution problems. After more research, it turns out that when a file is locked and you attempt to read it (even asynchronously), one of 4 threads (by default) will be taken up, waiting for the file call to be returned. One other job of these threads is DNS resolution!

The developer quickly made a change to move the cache to memory and re-hydrate on launch. The infrastructure engineer made a change to the UV_THREADPOOL_SIZE environment variable to help resolve the issue while the fix rolled out, and the problem began to ease immediately.

The post-mortem

During the incident post-mortem, we ended up writing up what went wrong:

Schema cache writes files to a shared volume on one pod
Schema is attempted to be read/written by another pod
Under high traffic, these operations deadlock
The cache operations saturate the threadpool while waiting on locks
dns.lookup() calls queue up waiting for a free thread
Queued DNS lookups eventually timeout
Application logs: EAI_AGAIN

In the end, even though the error was related to DNS, the actual problem was filesystem contention that only occurred when running in environments with multiple replicas.

Part 4: Looking to the future

Though finding this bug was painful for me at the time, it helped me to bring a new approach to troubleshooting failures, especially around distributed systems like Kubernetes with multiple layers of abstraction.

Involve developers early

Developers have the most context around what’s going on in their app - if something is going on, they’ll know the changes and usage patterns much more intimately than infrastructure teams. Ensure that developers have the necessary access to independently investigate.

Make sure deployment history is visible

When something breaks, “What changed?” should be the first question. Making it easy for both infrastructure and software teams to see the deployment history to environments is one of the best things you can do.

A list of deployments to separate environments

Document dependencies

Software tends to have the most problems in the seam between software and hardware - IO. When using networking, storage, or other IO, it’s worth documenting why and how you’re doing so. It helps infrastructure teams understand requirements, especially in distributed systems.

Automate rollbacks

A significant reason this bug affected people longer than we had hoped was due to the complicated and involved rollback process. One of the first things we did afterwards was create a runbook to automate the process of rolling back, so that we could mitigate the problem faster and buy time to investigate issues.

A runbook for automating the runbook procedure

Conclusion

When you’re troubleshooting in Kubernetes, you’re troubleshooting a system that spans multiple layers of abstraction, multiple teams, and multiple mental models. No matter what tools or processes you use, when you’re trying to solve a problem, allowing as many people as possible to contribute their understanding of the systems involved is the most important thing to do.

The next time you’re debugging an incident and discover DNS caused it, I hope you’ll remember this story about one of the rare instances when it wasn’t. Remember the lessons we learnt, so if the next issue isn’t DNS, you’ll be able to dig deeper and solve the problem with way less stress than we did.

Tags:

Liam MackiePublished: December 9, 2025

Octopus Deploy GitHub app improvements

Recent improvements made to the Octopus Deploy GitHub App

Harriet AlexanderDecember 11, 2025 • 3 min read

Banner showing a parent environment branching to two ephemeral environments while two illustrated developers point to the child nodes against a blue-green gradient background.

Introducing Ephemeral Environments

Launching early access for Ephemeral Environments.

Harriet AlexanderNovember 26, 2025 • 3 min read

Launching the Octopus MCP Server

The Octopus MCP Server provides your AI assistant with powerful tools that allow it to explore, inspect, and diagnose problems within your Octopus instance, transforming it into your ultimate DevOps wingmate.

Andrew BestSeptember 29, 2025 • 4 min read

Explore

Learn

Careers

Explore

Contact us

User

It works on my cluster: a tale of two troubleshooters

The backstory

Part 1: The Infrastructure Investigation

Starting With What You Know

Expanding the Search

Checking the Metrics

Let’s try with load

The Dead End

Part 2: The Developer Investigation

Checking the deployment history

Looking for patterns

Replication

Part 3: Bridging the Gap

Following the Thread

The post-mortem

Part 4: Looking to the future

Involve developers early

Make sure deployment history is visible

Document dependencies

Automate rollbacks

Conclusion

Tags:

Related posts

The backstory

Part 1: The Infrastructure Investigation

Starting With What You Know

Expanding the Search

Checking the Metrics

Let’s try with load

The Dead End

Part 2: The Developer Investigation

Checking the deployment history

Looking for patterns

Replication

Part 3: Bridging the Gap

Following the Thread

The post-mortem

Part 4: Looking to the future

Involve developers early

Make sure deployment history is visible

Document dependencies

Automate rollbacks

Conclusion

Tags:

Related posts

Newsletter