Stylized image of DevOps infinity symbol with a car driving on it and increasing speed over golden arrows.

Platform Engineering at scale: Lessons from the conference circuit

Tony Kelly
Tony Kelly

Q4 is conference season in the US with GitHub Universe, KubeCon (including ArgoCon and Platform Engineering day), and AWS reInvent all taking place in November and December.

In our latest episode of CD Office Hours, Bob Walker and Steve Fenton shared some of the topics that came up at these events. You can get some interesting insights into where the industry is right now with Platform Engineering. The TL;DR is that most organizations are on version 2.0 or 3.0 of their platforms, and there are some hard-won lessons to share along the way.

Differing audiences for GitHub Universe and KubeCon

It’s worth calling out an interesting observation that Bob noticed between the audiences at GitHub Universe and KubeCon. The audience at GitHub Universe has a much higher ratio of organization leadership to individual contributors compared to KubeCon. This leads to more strategic conversations at GitHub Universe and more tactical conversations at KubeCon.

However, the same themes resonated at both events: Platform Hub, templating, and blueprints and guardrails. Key initiatives at many organizations include seeking solutions to standardize their deployment processes while maintaining flexibility.

KubeCon + GitHub Universe intro

Platform 2.0 and 3.0: Learning the hard way

Platform Engineering Day is an event that happens on Day 0 of KubeCon week and which Octopus Deploy is a sponsor.

It was at Platform Engineering Day that the theme of teams working on the second or third iteration of their internal platforms really came to light.

Bob describes the typical theme that came up in conversation: “They’ve kind of like, oh, we’ve tried this and we’ve run into this issue. It’s like, okay, now we need to kind of pull this back and rebuild it a bit because it’s just not working out for what we have.”

It’s essential not to frame this as a failure story, but rather to view it as a maturity story. The first iteration of the platform works well for the initial teams that adopt it. However, when reality hits, and you need to scale to hundreds or thousands of teams, new requirements and considerations come into play. Version 2.0, 3.0, and beyond are then born.

Challenges and insights from early adopters

The early adopters of a new platform usually fall into a few categories:

  • They’re more forward-thinking and cutting-edge
  • They’re often working on greenfield applications
  • They can get away with shortcuts that don’t scale

Bob explains: “When you start rolling it out at scale to hundreds or thousands of teams, you’re probably just not going to be able to do those shortcuts. And so that’s where people are starting to see some of these challenges at scale.”

Scaling agile: Challenges from Insights from early teams

With one of the major events of the last few weeks being KubeCon, it’s no surprise that many Platform Engineering conversations centered around Kubernetes and Argo CD. Some organizations are opting to standardize on GitOps solutions, but this presents some challenges.

While Argo CD is built around the concept of declarative GitOps deployments, there are some gaps that enterprises often bump into:

  • Limited environment concept: You cannot natively model dev, staging, and production in Argo CD - they are often set up as separate applications.
  • No concept of environment progression: Given the above point, you cannot push immutable versioned releases through your environment lifecycle.
  • Limited orchestration: There is limited support for pre and post-deploy steps. E.g. ServiceNow integration, database deployments, etc.

All of these gaps become more apparent when teams move beyond the early adopters’ ‘happy path’ deployments.

The ‘How many Argos?’ question

How do the above Argo pains manifest themselves when talking to attendees at an event like KubeCon?

There’s usually a sign in how they answer the ‘How many Argos?’ question.

Steve explains: “If you have one Argo, it is actually superbly simple and you just don’t have many problems. But as soon as you have an answer that’s greater than one, you start bumping into those things where it gets a bit more tricky.”

The answer also isn’t limited to the number of Argo instances that are in play. How many applications you have within an Argo instance can also reveal a few curve balls as Bob Walker discovered, “We have one Argo instance, but we have 30,000 applications in that Argo instance.”

Argo CD: Scaling challenges and best practices explained

Bob goes on to explain that the team in question has adopted a classic hub-and-spoke model. One Argo CD instance that is talking to multiple Kubernetes clusters.

This approach simplifies where you look for things as they are all contained within one Argo CD instance, but it does create security concerns, as Steve explains, “You’re effectively giving people this one room. If you can get into this one room, all the vaults come off of that one room, and you just have access to everything.”

Bob’s response brings us back to what it means from the perspective of platform architecture decisions: “Does it work for you? Does it satisfy your company’s policies and requirements?”

Which, as Steve reinforces, often leads to discovering that “people don’t know what their organizational policy and requirements are. So it’s kind of back to the drawing board.”

Argo CD: Hub-and-spoke model risks

Make policies mandatory, not platforms

That segues us nicely across the Atlantic Ocean to a London DevOps meetup where the topic of Platform Engineering was still high on the agenda.

Steve walked us through the conversation around his presentation at London DevOps that challenges the typical approach to platform adoption.

The typical approach: Make the platform mandatory

The better approach: Make policies mandatory, the platform is optional

The compliance disconnect

Steve describes the typical pattern that happens within an organization: “Teams that are on the platform have got all of this compliance and security stuff, all of these requirements, and they’re hitting them because they’re on the platform. And if you’re a team that’s not on the platform, you kind of get away with it. You just don’t do them.”

That leads to teams avoiding the platform so they can avoid compliance requirements and take some shortcuts. If the goal is changed to ensure everyone meets the compliance requirements, the platform should naturally become a more appealing option.

Creating demand instead of mandates

Steve explains how this mindset shift can make an impact: “Instead of making your platform mandatory, you need to make your policies mandatory. Every pipeline should include security scanning and artifact attestation or whatever it is. That’s what should be mandatory. But if the team solves it without the platform, you’re happy.”

Mandatory policies over platform: A smart strategy?

This approach creates natural demand for the platform. Teams realize: “Yeah, we can do CI/CD. But then there’s all these other things that we don’t really want to do, but if we use the platform, we’ll get those for free.”

AI is now everywhere - where does it fit in the conversation with platform teams?

The bridge between AI and platform teams focused on a common pain point - they’re dealing with an increasing number of ‘support tickets’ for the platform.

Bob describes the problem: “When I talked to some Platform Engineers at Platform Engineering Day, they’re like, yes, that happens a lot more than we think because we’ve effectively become a ticketing system. We have all these templates, and we’re the ones supplying them to the developers. And if something goes wrong immediately, they just turn around and say, this isn’t working, fix it.”

AI-powered triage and remediation

The solution focuses on enabling developers to self-serve and resolve common issues:

  1. AI interprets deployment logs to explain why a step failed
  2. AI provides recommendations for fixing the issue
  3. Identifies when it’s a transient error and suggests retrying it
  4. Escalates only when it’s a template bug requiring platform team intervention

Bob explains: “Providing that self-service remediation like, yes, this was caused because there was a network issue. Go ahead and retry it. I retried it, it worked. Happy days. I didn’t have to bug anyone about that.”

This use of AI fits perfectly in the non-deterministic failure resolution space. Steve notes: “That’s effectively what you would be going and searching this stuff up online and trawling through it to find answers. So it can shortcut that process.”

Self-service remediation: Empowering developers

The deterministic versus non-deterministic line

When it comes to CI/CD and Platform Engineering, there’s a clear boundary between where AI can help and where AI really doesn’t belong.

Many vendors are falling over themselves with AI-powered everything messaging, but with little substance underneath the buzzwords. Our conversation on the usefulness of AI centered around non-deterministic and deterministic tasks.

Non-deterministic tasks (AI helps):

  • Failure analysis and remediation recommendations
  • Prospecting and research
  • Getting started on scripts or configurations
  • Documentation summaries

Deterministic tasks (AI doesn’t belong):

  • Deployment execution
  • Build processes
  • Compliance attestation
  • Anything requiring audit trails

Bob emphasizes: “When it comes back to CI/CD, I’m not going to use it to generate a complete CI/CD pipeline, or have AI make the determination as to what steps to run. I want that to be deterministic. I want it to be consistent every single time. Also, it has to be deterministic if you have to have any sort of compliance like SOX compliance, Dodd-Frank, PCI, HIPAA, because you have to attest to those things.”

The realistic expectation

Bob summarizes the practical approach: “Go into it going, I think what it’s going to produce is 90% there, 80% correct, but I still need to check the other 20%. I think you’re okay with doing something like that. Use AI to speed up the non-thinking parts of your day - the repetitive, all that extra stuff - but learn how things work.”

Catch the full episode in the video below.

Full episode

Happy deployments!

Tony Kelly

Related posts