The image shows an isometric illustration of a central octopus-like icon connected to multiple brain icons in a hub-and-spoke network pattern against a purple-to-blue gradient background.

Resilient AI agents with MCP: Timeout and retry strategies

Matthew Casperson
Matthew Casperson

MCP reduces the barrier to entry for developers and organizations looking to automate workflows across multiple systems. But while it is possible to build a functional AI agent with just a few lines of code, production-grade systems need to be resilient and handle failures gracefully.

In this post we’ll explore the options available in Langchain and Python to add timeout, retry, and circuit breaker strategies to your AI agents using MCP.

Why add timeout and retry strategies?

By their very nature, MCP clients (and the AI agents that implement them) add the most value when they are orchestrating multiple platforms. However, as the number of external systems increases, so does the likelihood of failure. External systems may be unavailable, slow to respond, or return errors. Our AI agent must be resilient and handle these failures gracefully.

Distributed systems have adopted a common set of patterns to handle failures, including timeouts, retries, and circuit breakers. These patterns help to ensure that our AI agents can continue to function, or fail gracefully, even when external systems are experiencing issues.

Retry strategies in Langchain

Langchain interacts with MCP servers via tools. More specifically, tools are instances of the StructuredTool class. Langchain builds instances of StructuredTool classes for you when you call the get_tools function on the MultiServerMCPClient class.

In theory, Langchain has the ability to define retry strategies for tools. Specifically, the Runnable class has a with_retries function to add retry logic to any Runnable. However, I was unable to take the StructuredTool instances returned by get_tools and add retry logic to them via the with_retries function, and there is no built-in support for retry strategies to tools generated by MultiServerMCPClient class. The inability to customize the generated tools is reflected in this issue, which documents the limitation around error handling and MCP tools.

To work around this limitation, we will instead use the Gang of Four proxy pattern to create a wrapper around the StructuredTool instances returned by the get_tools function. It is inside this wrapper that we will implement our retry logic.

Fortunately we do not have to implement the proxy or retry logic from scratch. The wrapt and tenacity libraries make implementing the proxy and retry logic straightforward:

@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@retry(
    stop=stop_after_attempt(3),
    wait=wait_fixed(1),
    retry=retry_if_exception_type(Exception),
)
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
    print("StructuredTool.ainvoke called")
    return await wrapped(*args, **kwargs)

We intercept calls to the ainvoke function of the StructuredTool class by defining a function with the @wrapt.patch_function_wrapper annotation. This annotation takes two arguments: the module name and the function name.

The intercepted calls are then retried with the @retry annotation. This annotation takes several arguments to define the retry strategy. In this example, we will retry up to three times, waiting one second between each attempt, and retry on all exceptions.

Inside the function we add some logging to provide confirmation that our proxy is being called. We then called the wrapped function, which is the original ainvoke function of the StructuredTool class.

And that is it! The wrapt library will intercept all calls to the ainvoke function of any StructuredTool object generated by Langchain, and our retry logic will be applied via the tenacity library.

:::div{.hint} One thing to watch out for using the proxy strategy is that we are wrapping an async function. Not all retry and circuit breaker libraries support async function. You’ll need to keep this in mind if you want to use other resilience libraries. :::

Timeouts in Langchain

Timeouts can be defined through a parameter passed to the ClientSession constructor. The MultiServerMCPClient constructor exposes the session_kwargs argument whose values are passed to the ClientSession constructor.

This example demonstrates how set a read timeout of 60 seconds for a specific MCP server:

client = MultiServerMCPClient(
        {
            "octopus": {
                "command": "npx",
                "args": [
                    "-y",
                    "@octopusdeploy/mcp-server",
                    "--api-key",
                    os.getenv("PROD_OCTOPUS_APIKEY"),
                    "--server-url",
                    os.getenv("PROD_OCTOPUS_URL"),
                ],
                "transport": "stdio",
                "session_kwargs": {"read_timeout_seconds": timedelta(seconds=60)},
            },
            "zendesk": {
                "command": "uv",
                "args": [
                    "--directory",
                    "/home/matthew/Code/zendesk-mcp-server",
                    "run",
                    "zendesk",
                ],
                "transport": "stdio",
            },
        }
    )

Circuit breakers in Langchain

Circuit breakers are used to prevent an application from repeatedly trying to execute an operation that is likely to fail. This prevents downstream services that are already struggling from being overwhelmed with requests.

We’ll make use of the purgatory library to implement a circuit breaker for our MCP tools.

The first step is to create a AsyncCircuitBreakerFactory instance. This instance must be long-lived and shared between requests:

circuitbreaker = AsyncCircuitBreakerFactory(default_threshold=3)

Circuit breakers are only useful in long-lived applications, for example, a web server or a microservice. This is because the circuit breaker logic needs to maintain state about the number of recent failures. A short-lived application, such as a script that runs once and exits, will not benefit from a circuit breaker.

Similar to the retry logic, we’ll use the wrapt library to create a proxy around the ainvoke function of the StructuredTool class, and use the @circuitbreaker annotation to apply the circuit breaker logic:

@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@circuitbreaker("StructuredTool.ainvoke")
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
    print("StructuredTool.ainvoke called")
    return await wrapped(*args, **kwargs)

Simulating failures

To simulate failures, we can create a proxy around the ainvoke function of the BaseTool class. The StructuredTool class inherits from the BaseTool class, and the ainvoke function of the BaseTool class is called by the ainvoke function of the StructuredTool class. This gives us a convenient place to simulate failures for all tools.

Here we randomly raise an exception two-thirds of the time to simulate a transient error:

@wrapt.patch_function_wrapper("langchain_core.tools", "BaseTool.ainvoke")
async def basetool_ainvoke(wrapped, instance, args, kwargs):
    print("BaseTool.ainvoke called")
    if random.randint(1, 3) != 3:
        print("Simulated transient error")
        raise RuntimeError("Simulated transient error")
    return await wrapped(*args, **kwargs)

If you have implemented a circuit breaker strategy, you should see that the MCP client eventually stops calling the MCP server after a few failures. If you have implemented a retry strategy with a high level of retries, you should see the prompt succeed as the retry library intercepts the exceptions and retries the request.

Conclusion

Production-grade AI agents need to handle failures gracefully. By implementing timeout, retry, and circuit breaker strategies, we can ensure that our AI agents are resilient and can continue to function even when external systems are experiencing issues.

Langchain has some built-in support for timeouts, but implementing retry and circuit breaker strategies requires some additional work. By using the wrapt, retry, and pybreaker libraries, we can easily add these strategies to our MCP tools via the proxy pattern.

Happy deployments!

Matthew Casperson

Related posts