Resilient AI Agents With MCP: Timeout And Retry Strategies

Matthew CaspersonOctober 3, 2025 • 6 min read

MCP reduces the barrier to entry for developers and organizations looking to automate workflows across multiple systems. But while it is possible to build a functional AI agent with just a few lines of code, production-grade systems need to be resilient and handle failures gracefully.

In this post we’ll explore the options available in Langchain and Python to add timeout, retry, and circuit breaker strategies to your AI agents using MCP.

Why add timeout and retry strategies?

By their very nature, MCP clients (and the AI agents that implement them) add the most value when they are orchestrating multiple platforms. However, as the number of external systems increases, so does the likelihood of failure. External systems may be unavailable, slow to respond, or return errors. Our AI agent must be resilient and handle these failures gracefully.

Distributed systems have adopted a common set of patterns to handle failures, including timeouts, retries, and circuit breakers. These patterns help to ensure that our AI agents can continue to function, or fail gracefully, even when external systems are experiencing issues.

Retry strategies in Langchain

Langchain interacts with MCP servers via tools. More specifically, tools are instances of the StructuredTool class. Langchain builds instances of StructuredTool classes for you when you call the get_tools function on the MultiServerMCPClient class.

In theory, Langchain has the ability to define retry strategies for tools. Specifically, the Runnable class has a with_retries function to add retry logic to any Runnable. However, I was unable to take the StructuredTool instances returned by get_tools and add retry logic to them via the with_retries function, and there is no built-in support for retry strategies to tools generated by MultiServerMCPClient class. The inability to customize the generated tools is reflected in this issue, which documents the limitation around error handling and MCP tools.

To work around this limitation, we will instead use the Gang of Four proxy pattern to create a wrapper around the StructuredTool instances returned by the get_tools function. It is inside this wrapper that we will implement our retry logic.

Fortunately we do not have to implement the proxy or retry logic from scratch. The wrapt and tenacity libraries make implementing the proxy and retry logic straightforward:

@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@retry(
    stop=stop_after_attempt(3),
    wait=wait_fixed(1),
    retry=retry_if_exception_type(Exception),
)
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
    print("StructuredTool.ainvoke called")
    return await wrapped(*args, **kwargs)

We intercept calls to the ainvoke function of the StructuredTool class by defining a function with the @wrapt.patch_function_wrapper annotation. This annotation takes two arguments: the module name and the function name.

The intercepted calls are then retried with the @retry annotation. This annotation takes several arguments to define the retry strategy. In this example, we will retry up to three times, waiting one second between each attempt, and retry on all exceptions.

Inside the function we add some logging to provide confirmation that our proxy is being called. We then called the wrapped function, which is the original ainvoke function of the StructuredTool class.

And that is it! The wrapt library will intercept all calls to the ainvoke function of any StructuredTool object generated by Langchain, and our retry logic will be applied via the tenacity library.

:::div{.hint} One thing to watch out for using the proxy strategy is that we are wrapping an async function. Not all retry and circuit breaker libraries support async function. You’ll need to keep this in mind if you want to use other resilience libraries. :::

Timeouts in Langchain

Timeouts can be defined through a parameter passed to the ClientSession constructor. The MultiServerMCPClient constructor exposes the session_kwargs argument whose values are passed to the ClientSession constructor.

This example demonstrates how set a read timeout of 60 seconds for a specific MCP server:

client = MultiServerMCPClient(
        {
            "octopus": {
                "command": "npx",
                "args": [
                    "-y",
                    "@octopusdeploy/mcp-server",
                    "--api-key",
                    os.getenv("PROD_OCTOPUS_APIKEY"),
                    "--server-url",
                    os.getenv("PROD_OCTOPUS_URL"),
                ],
                "transport": "stdio",
                "session_kwargs": {"read_timeout_seconds": timedelta(seconds=60)},
            },
            "zendesk": {
                "command": "uv",
                "args": [
                    "--directory",
                    "/home/matthew/Code/zendesk-mcp-server",
                    "run",
                    "zendesk",
                ],
                "transport": "stdio",
            },
        }
    )

Circuit breakers in Langchain

Circuit breakers are used to prevent an application from repeatedly trying to execute an operation that is likely to fail. This prevents downstream services that are already struggling from being overwhelmed with requests.

We’ll make use of the purgatory library to implement a circuit breaker for our MCP tools.

The first step is to create a AsyncCircuitBreakerFactory instance. This instance must be long-lived and shared between requests:

circuitbreaker = AsyncCircuitBreakerFactory(default_threshold=3)

Circuit breakers are only useful in long-lived applications, for example, a web server or a microservice. This is because the circuit breaker logic needs to maintain state about the number of recent failures. A short-lived application, such as a script that runs once and exits, will not benefit from a circuit breaker.

Similar to the retry logic, we’ll use the wrapt library to create a proxy around the ainvoke function of the StructuredTool class, and use the @circuitbreaker annotation to apply the circuit breaker logic:

@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@circuitbreaker("StructuredTool.ainvoke")
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
    print("StructuredTool.ainvoke called")
    return await wrapped(*args, **kwargs)

Simulating failures

To simulate failures, we can create a proxy around the ainvoke function of the BaseTool class. The StructuredTool class inherits from the BaseTool class, and the ainvoke function of the BaseTool class is called by the ainvoke function of the StructuredTool class. This gives us a convenient place to simulate failures for all tools.

Here we randomly raise an exception two-thirds of the time to simulate a transient error:

@wrapt.patch_function_wrapper("langchain_core.tools", "BaseTool.ainvoke")
async def basetool_ainvoke(wrapped, instance, args, kwargs):
    print("BaseTool.ainvoke called")
    if random.randint(1, 3) != 3:
        print("Simulated transient error")
        raise RuntimeError("Simulated transient error")
    return await wrapped(*args, **kwargs)

If you have implemented a circuit breaker strategy, you should see that the MCP client eventually stops calling the MCP server after a few failures. If you have implemented a retry strategy with a high level of retries, you should see the prompt succeed as the retry library intercepts the exceptions and retries the request.

Conclusion

Production-grade AI agents need to handle failures gracefully. By implementing timeout, retry, and circuit breaker strategies, we can ensure that our AI agents are resilient and can continue to function even when external systems are experiencing issues.

Langchain has some built-in support for timeouts, but implementing retry and circuit breaker strategies requires some additional work. By using the wrapt, retry, and pybreaker libraries, we can easily add these strategies to our MCP tools via the proxy pattern.

Happy deployments!

Tags:

Matthew CaspersonPublished: October 3, 2025

Two people showing financial benefit when compliance is adopted correctly

How regulated organizations accelerate through smart approvals

We dive further into our Compliance through Continuous Delivery report an discover how regulated organizations can accelerate with smart approvals processes

Charlotte FlemingDecember 3, 2025 • 5 min read

Stylized image of a user configuring policies in Octopus

Policies are now Generally Available

Policies enable compliance at scale, enabling customers to enforce standards through code

Venkatesh VasudevanDecember 2, 2025 • 3 min read

An image depicting someone putting together a group of Octopus deployment steps to make a process template.

Announcing Process Templates

A blog post outlining our launch of process templates in public preview.

Venkatesh VasudevanDecember 2, 2025 • 5 min read

Explore

Learn

Careers

Explore

Contact us

User

Resilient AI agents with MCP: Timeout and retry strategies

Why add timeout and retry strategies?

Retry strategies in Langchain

Timeouts in Langchain

Circuit breakers in Langchain

Simulating failures

Conclusion

Tags:

Related posts

Why add timeout and retry strategies?

Retry strategies in Langchain

Timeouts in Langchain

Circuit breakers in Langchain

Simulating failures

Conclusion

Tags:

Related posts

Newsletter