Two people inspect packages on a conveyor belt, marking them with check or cross symbols based on clipboard data.

Deploying LLMs with Octopus

Matthew Casperson

Large Language Models (LLMs) are becoming an increasingly common component in modern applications, powering everything from chatbots to code generation tools. However, DevOps teams must address several challenges to ensure LLMs are deployed reliably and efficiently. In this post, we’ll explore some key considerations for deploying LLMs, how Docker helps address these challenges, and how Octopus empowers DevOps teams to deploy LLMs with confidence.

Challenges of deploying LLMs

In the video Prebuilt Docker Images for Inference in Azure Machine Learning Shivani Sambare noted that:

We have interacted with many of our customers and data scientists and what we know [is] once you have trained a machine learning model getting that model to production is key, especially having the right environments.

Environments encapsulates all your python packages and software setting around your scoring script and hence having the right dependency when you deploy your machine learning model is very important.

Prebuilt docker images provides a solution to these requirements with a consistent environment, resulting in images that can be run locally, deployed to Azure Container Instances, or run in Kubernetes.

AWS uses Docker to execute and deploy LLMs via their Deep Learning Containers. Emily Webber highlighted the benefits that these images provided in her talk AWS Deep Learning Containers Deep Dive, including:

  • Avoiding time-consuming builds of base images
  • The ability to run containers anywhere
  • The standardization of your machine learning development, training, tuning, and deployment environments
  • Taking advantage of the latest optimizations and performance improvements

NVIDIA supplies Docker images for hosting LLMs with NVIDIA Inference Microservices (NIM), noting that these containers provide:

  • No external dependencies: You’re in control of the model and its execution environment
  • Data privacy: Your sensitive data never leaves your infrastructure
  • Validated models: You get these models as intended by their authors
  • Optimized runtimes: Accelerated, optimized, and trusted container runtimes

We then have the statistic from The Voice of Kubernetes Experts Report 2024 showing that 54% of companies are running AI/ML workloads on Kubernetes.

These industry examples demonstrate how Docker images provide a solid foundation for deploying LLMs, ensuring that the environment is consistent and reproducible across different development and deployment stages.

But building a Docker image is just the first step. DevOps teams are then responsible for deploying and maintaining these images and their containers in production environments. This is where Octopus Deploy comes in.

Repeatable deployments

To understand why repeatable deployments are important, consider the nim-deploy repository from NVIDIA which:

Is designed to provide reference architectures and best practices for production-grade deployments and product integrations

The repo contains instructions and sample configuration files to deploy NIMs to multiple platforms. The instructions involve dozens of steps requiring multiple CLI tools, environment variables, configuration files, and credentials.

It is not feasible to expect every DevOps engineer to remember all of these steps, and even if they could, deployments would be error-prone. This is where Octopus shines. By creating repeatable deployment processes, Octopus allows LLMs to be deployed with the click of a button or automatically in response to an external trigger.

Google’s AI and ML perspective: Operational excellence documentation notes that:

Automation enables seamless, repeatable, and error-free model development and deployment.

With Octopus, everything required to deploy an LLM image is defined in a repeatable deployment process that references the details of Kubernetes clusters defined as deployment targets, using credentials stored as accounts, and consuming images from a container registry via feeds. The deployment process is then repeated across environments, ensuring internal environments such as Development or QA provide the same experience as the Production environment.

This ensures that the LLMs are run with a consistent set of libraries and dependencies (which is how Azure defines an environment in terms of their prebuilt Docker images) but also that the resulting Kubernetes deployments are consistent across the infrastructure used for internal development, testing, and production (which is how Octopus defines an environment).

A screenshot of the deployment process

Visibility and monitoring

What is the state of the Production environment?

This is a critical question for DevOps teams, but one that can be difficult to answer. When deployments are done manually, the only way to know what version of your LLM is providing responses to your customers is to manually inspect the Kubernetes cluster or ask the individual who performed the deployment.

Octopus makes it trivial to understand the state of production by providing a centralized dashboard showing the current version of each LLM deployed to each environment.

A screenshot of the dashboard showing deployments

The Kubernetes Live Object Status (KLOS) feature goes a step further displaying the current state of the resources in the cluster. KLOS gives DevOps teams confidence that their LLMs are running as expected, and if there are any issues, they can quickly identify which resources are affected.

A screenshot of the dashboard showing the Kubernetes live object status

Testing and validation

One of the challenges with LLMs is that their behavior and performance are often subjective. Safety and QA teams may need to validate LLMs by running several sample prompts and reviewing the responses. This kind of testing requires an LLM to be deployed and available in a test environment.

It is also essential to understand how the LLM interacts with other system components, such as databases, APIs, and other services. This kind of integration testing is best performed in a private environment that closely resembles production.

Google calls this out with “Test, Test, Test” as part of their responsible AI practices, saying developers should:

Conduct integration tests to understand how individual ML components interact with other parts of the overall system.

Octopus supports LLM testing by ensuring teams can deploy to private production-like environments, such as a staging or QA environment, where the LLM can be tested in conjunction with other system components.

Auditing and compliance

AI has been embraced by many organizations subject to regulatory requirements. From banks using GenAI to improve their architectures to healthcare providers integrating LLMs into their call centers, GenAI and LLMs are being embedded into critical systems of some of the most important industries in the world.

It is, therefore, critical that organizations can track the changes to their production LLMs, including when they were deployed, what version was deployed, and who performed the deployment. CSIRO calls out “supply chain accountability” as a core part of their responsible AI research.

Manual deployments make this kind of auditing all but impossible. Even if it were possible to reverse engineer the changes made to a production system, it would be a time-consuming and error-prone process.

Octopus provides auditing out-of-the-box, allowing teams to see the history of deployments, including who performed the deployment and when. Auditing is treated as a cross-cutting concern applied to every action taken in Octopus, giving teams confidence that they can track changes to their LLMs and ensure compliance with regulatory requirements.

A screenshot of the audit log showing deployments

Incremental deployments, rollbacks, and recovery

Even with the best planning and testing, things can go wrong when deploying LLMs. A new version of an LLM may not perform as expected, or it may introduce regressions that affect the system’s performance.

The post Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock notes that:

Automated deployment techniques together with smaller, incremental changes reduces the blast radius and allows for faster reversal when failures occur.

There are several deployment strategies that can be used to mitigate these risks, including:

  • Blue/green deployments, which involve deploying a new version of an LLM alongside the existing version and then switching traffic to the new version once it has been validated.
  • Canary releases, which involve deploying a new version of an LLM to a small subset of users and then gradually increasing the number of users as the new version is validated.

In the event of a failed deployment or a regression, it is important to quickly roll back to a previous version of the LLM. By capturing all the values used to deploy the LLM as a release, Octopus makes it easy to redeploy a previous version of the LLM.

Managing infrastructure and supporting services

LLMs are just one part of a larger system including API gateways for traffic management and security, file storage for model files, identity providers to manage authentication, and, of course, the Kubernetes clusters that run the LLMs. All these components require the same repeatable deployments, visibility, monitoring, auditing, testing, and recovery capabilities that are so important for LLMs.

Google’s AI and ML perspective: Operational excellence documentation advises teams to:

Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.

Octopus provides a unified platform for managing all of these components, supporting Infrastructure as Code (IaC) tools such as Terraform, ARM templates, and CloudFormation to deploy and manage the underlying infrastructure.

Orchestrating and approving deployments

The decision to deploy an LLM to production is often not made by the DevOps team alone. It may require approval from stakeholders such as product owners, security teams, or compliance officers. This requires approval workflows that ensure the right people are involved in the decision-making process.

Microsoft’s Responsible AI principles and approach calls out accountability as a key requirement, asking:

How can we create oversight so that humans can be accountable and in control?

Octopus provides manual intervention steps that pause a deployment until it is approved by the appropriate stakeholders. And for those teams that use ITSM tools such as ServiceNow or Jira Service Manager, Octopus can block deployments until a ticket or change request is approved as part of a larger change management process.

Octopus manual intervention step

Day 2 operations and maintenance

Getting an LLM into production is just the beginning. Once deployed, ongoing maintenance and support are required to ensure it performs as expected. This can include tasks such as:

  • Restarting services if they become unresponsive
  • Applying security patches to the underlying infrastructure
  • Collecting and analyzing logs to identify performance issues
  • Backup and recovery in the event of a failure

Octopus Runbooks let teams define and execute these maintenance tasks in a consistent and repeatable manner. Runbooks can be configured to run the same steps available in the deployment process, access all the same credentials, and interact with the same infrastructure. Runbooks can then be run in any environment to perform maintenance and ad-hoc tasks. This ensures that the same steps are followed every time, reducing the risk of human error and ensuring that the LLM and any supporting infrastructure remain healthy.

A screenshot of Octopus runbooks

Conclusion

Docker has emerged as a key enabler for deploying LLMs, providing a consistent and reproducible environment for running these complex models. However, creating a Docker image is just the first step. DevOps teams must also consider how to deploy, monitor, and maintain these images in production environments.

Octopus provides a comprehensive platform for deploying LLMs, addressing the challenges of repeatable deployments, visibility and monitoring, testing and validation, auditing and compliance, incremental deployments, and day 2 operations. This allows teams to automate and scale the entire lifecycle of LLMs, ensuring that they can deliver reliable and performant AI solutions to their customers.

Matthew Casperson

Related posts