A person adding a tick to a checklist which is above an item marked with an exclamation mark

Outage on octopus.com - report and learnings

Alix Klingenberg

From 3:37pm UTC Wednesday, January 25, 2023 to 3:44pm UTC Thursday, January 26, 2023, customers in some regions couldn’t access octopus.com/signin, the blog, the docs, and other URLs, as they returned HTTP 404 (page not found) responses. This mostly affected the US and EU, with some intermittent impact in regions in Asia.

In this post, we detail what happened, our response, and our learnings.

How the incident started

On January 25, we undertook scheduled maintenance on the octopus.com and octopusdeploy.com domains as part of a planned DNS and routing provider migration. In this step of the migration, the routing setup in one Azure Front Door (AFD) profiles was transferred to another AFD profile. The new AFD profile didn’t behave as expected.

We identified that a subset of requests in certain regions were incorrectly routed by the new AFD profile, resulting in HTTP 404 responses. We publicized the service interruption for the scheduled maintenance on our Status Page.

We treated this as an incident because the maintenance didn’t go as planned and there were complicating factors:

  • A network outage from Azure
  • An internal routing issue in AFD
  • The new WAF created significant noise in our applications
  • The issue occurred during an Australian public holiday where many of our engineers are based

The changes to the AFD profiles couldn’t be safely rolled back.

Key timings

Event Time period
Time to detection 10hrs 21mins
Time to incident declaration 13hrs 27mins
Time to resolution 37hrs 44mins

Incident timeline

(All dates and times below are shown in UTC)

Wednesday, January 25, 2023

02:00 An Octopus engineer deployed a change to the octopus.com and octopusdeploy.com domains, moving the associated routing between Azure Front Door profiles as part of a migration project.

02:39 The engineer switched the AFD WAF from prevention mode to detection mode due to it blocking valid traffic.

07:05 A network outage in Azure obscured the issue - see our Octopus Cloud connectivity disruption report.

12:21 An application support engineer in the EU region reported experiencing intermittent 404s on the octopus.com domain.

12:43 Azure’s network outage was fully resolved.

14:06 First customer-facing issue was reported.

14:06 - 15:27 Application support engineers investigated the issue and determined it was intermittent and affecting customers in EU and US regions.

15:27 Status Page updated: An incident was declared.

15:27 - 17:06 Investigations showed that the web app and database for the affected routes weren't under any significant load. The new AFD profile reported a warning on the octopus.com CNAME record.

17:06 The on-call engineer determined the issue was in AFD as an endpoint dynamically resolved to 2 IP addresses, and one IP address consistently returned a 404.

17:22 Status Page updated: We are continuing to investigate this issue.

17:35 Azure support ticket raised with "Severity: B".

18:32 Status Page updated: Identified affected regions.

20:32 The incident was escalated to the engineer responsible for the AFD migration project.

20:32 - 21:42 We investigated the warning about CNAME records on the 2 domains in AFD.

21:36 Status Page updated: Internal escalation.

21:42 We had a call with an Azure support rep, who advised the warning about CNAME records was likely the root issue.

21:42 - 22:26 Attempted mitigation steps based on assumption that CNAME flattening was the root cause.

22:26 CNAME flattening was ruled out as the cause based on successful requests to one of the 2 IP addresses resolved from the affected endpoint. It had also successfully worked on the previous AFD profile for approximately a year. Our engineers focused on the failing route in the AFD profile.

22:37 The support ticket with Azure was escalated to "Severity: A".

22:37 - 02:01 Our engineers worked closely with the Azure support reps to provide enough information for Azure to identify and reproduce the issue.

22:59 Status Page updated: Cloud provider escalation.

Thursday, January 26, 2023

02:01 A new potential root cause was identified, leading to a support request being raised with original DNS provider.

02:22 An Azure application support engineer noted a high volume of 404 requests across the AFD profile, due to having set the mode of the WAF to detection instead of prevention.

03:12 The incident was escalated internally to bring in senior engineering management staff due to the length of time without a clear path to resolution.

04:15 A problem with the routing was identified: renewed focus on the AFD profile.

04:19 Status Page updated: Incident status changed to identified.

05:42 Previous cloud provider advised that their hosts were resolving to the correct IP addresses and recommended requesting public resolvers to flush their DNS caches.

05:43 Potential mitigation of using www subdomain to control request resolution in AFD suggested.

07:06 Azure advised that AFD could not support CNAME flattening.

07:13 Internal escalation to involve Trust & Safety and SecOps.

07:31 Azure confirmed the only path to resolution they saw was moving all DNS records to Azure DNS. Azure confirmed they'd sync the correct configuration to all their edge nodes.

08:49 We created a new AFD endpoint for www.octopus.com and added a redirect from the CNAME with the existing DNS provider.

08:53 Updated app services to accept the new origin header.

09:26 Disabled app-level redirect from www.octopus.com to octopus.com.

10:05 Updated config for affected SSO services.

10:07 Changed the root octopus.com to be a CNAME and created a 302 redirect from octopus.com to www.octopus.com.

10:18 Testing showed previously broken routes working again.

11:07 Status Page updated: Incident status changed to Monitoring and advice issued on how to resolve.

11:10 Internal reports showed that affected EU regions flipped from “mostly not working” to “mostly working”.

11:49 Internal reports showed that octopus.com was still 404ing.

12:19 Confirmed that the routing issue inside AFD was still happening despite a catch-all rule that should have been applied.

12:36 Attempted to force AFD to route certain URLs by creating an explicit set of rules.

12:38 Status Page updated: Incident status changed back to identified.

13:23 Azure confirmed they had reproduced the issue and engaged a Product Group for a fix.

14:30 Availability tests started reporting success.

14:53 Azure confirmed that a fix was applied an hour ago.

15:07 Status Page update: Incident status changed back to monitoring.

15:44 Status Page update: Incident resolved.

Technical details

The 404s were occurring because customers' requests were being routed to the incorrect web app. The 404 responses were region specific and intermittent. Working out the exact cause of the incorrect routing was complex. We needed to dive into the relationship between the DNS provider and the endpoint routing in Azure Front Door (AFD).

What we observed

A specific AFD endpoint, octo-public-prod-endpoint-f2e5gmecbdf3bfg9.z01.azurefd.net, for the octopus.com domain was dynamically resolving to one of 2 IP addresses. One of the 2 IP addresses returned a 404 on all the associated routes, /signin, /docs, and /blog, while the other IP address worked.

This indicated the CNAME warning was unlikely the cause, as some traffic was being successfully routed.

We discovered the 2 IP addresses that the AFD route resolved to were different from the IP addresses of the A-B pair of the web apps expecting to receive this traffic.

What happened

AFD doesn’t update all its nodes when a routing change happens, so some AFD nodes still directed traffic to the incorrect web application. The affected endpoint was resolving to the incorrect application before the migration. Without control of a CNAME record, AFD couldn’t authoritatively update the flattened A records.

We set up a CNAME record on AFD to give them authoritative control of all requests on that subdomain. The incorrect IP address resolution behavior on the affected endpoint continued.

The incorrect routing inside AFD stopped after Azure advised us they’d made internal changes.

Remediation

There were 2 red herrings that weren’t contributing factors:

  • The 404 errors from the WAF
  • The CNAME flattening on the apex domain

In practice, removing the proxy and switching to AFD’s WAF were unrelated to this issue. However, they created significant noise which obscured the real problem.

Similarly, using CNAME flattening on the apex domain made the causes unclear. CNAME flattening is where the DNS provider creates a series of A Records instead of a single CNAME record for dynamic resolution of the underlying IP address. You can learn more in this explainer by Cloudflare.

Based on our best understanding of the problem, we created a www subdomain on the affected domains. We created a CNAME record on this subdomain to give AFD authoritative control over resolving all requests to that domain. We then redirected traffic from the affected domains to the appropriate www subdomains.

This mitigation helped initially, but the incorrect IP address resolution in the AFD endpoint reappeared. The IP issue resolved after Azure advised they'd made changes to our AFD profile.

We understand Azure moved the affected AFD profile “from a legacy platform to the most updated platform,” but we haven’t had official confirmation at the time of writing.

We considered rolling back the AFD profile changes after we understood the problem. We decided against this because we knew only a subset of the AFD nodes would update, leaving some regions in a broken state.

Next steps

To help us understand this incident in more depth, we've engaged with our Azure account manager on this issue.

Immediately following resolution, we addressed the impact of the www subdomain mitigation, particularly around CORS, static assets, and SEO.

After we were back to a stable state, we continued with the migration project. We adapted the project to work closely with Azure to ensure that our changes work as expected. We continue to have scheduled downtime as we progress our migration. Based on our learnings, we can design rollback-safe migration steps. This helps us minimize the impact of scheduled downtime by rolling back immediately when there's unexpected behavior.

We ran an incident review and identified areas of improvement:

  • Development process: We are reviewing the quality controls in our development process to identify where we could have detected the problem before applying the changes to our production environment.

  • Incident response process for long-running incidents: We haven't experienced an incident over this length of time before. We started to run out of rested on-call engineers with appropriate context to manage the incident. We're investigating how other internal and external groups manage these situations and are updating our on-call rosters and guidance accordingly.

Conclusion

Octopus Deploy takes service availability seriously. The time to resolution on this incident was longer than we aim for. We apologize for the disruption to our customers due to this outage.

Loading...