MCP reduces the barrier to entry for developers and organizations looking to automate workflows across multiple systems. But while it is possible to build a functional AI agent with just a few lines of code, production-grade systems need to be resilient and handle failures gracefully.
In this post we’ll explore the options available in Langchain and Python to add timeout, retry, and circuit breaker strategies to your AI agents using MCP.
Why add timeout and retry strategies?
By their very nature, MCP clients (and the AI agents that implement them) add the most value when they are orchestrating multiple platforms. However, as the number of external systems increases, so does the likelihood of failure. External systems may be unavailable, slow to respond, or return errors. Our AI agent must be resilient and handle these failures gracefully.
Distributed systems have adopted a common set of patterns to handle failures, including timeouts, retries, and circuit breakers. These patterns help to ensure that our AI agents can continue to function, or fail gracefully, even when external systems are experiencing issues.
Retry strategies in Langchain
Langchain interacts with MCP servers via tools. More specifically, tools are instances of the StructuredTool class. Langchain builds instances of StructuredTool classes for you when you call the get_tools
function on the MultiServerMCPClient class.
In theory, Langchain has the ability to define retry strategies for tools. Specifically, the Runnable class has a with_retries
function to add retry logic to any Runnable. However, I was unable to take the StructuredTool
instances returned by get_tools
and add retry logic to them via the with_retries
function, and there is no built-in support for retry strategies to tools generated by MultiServerMCPClient
class. The inability to customize the generated tools is reflected in this issue, which documents the limitation around error handling and MCP tools.
To work around this limitation, we will instead use the Gang of Four proxy pattern to create a wrapper around the StructuredTool
instances returned by the get_tools
function. It is inside this wrapper that we will implement our retry logic.
Fortunately we do not have to implement the proxy or retry logic from scratch. The wrapt and tenacity libraries make implementing the proxy and retry logic straightforward:
@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@retry(
stop=stop_after_attempt(3),
wait=wait_fixed(1),
retry=retry_if_exception_type(Exception),
)
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
print("StructuredTool.ainvoke called")
return await wrapped(*args, **kwargs)
We intercept calls to the ainvoke
function of the StructuredTool
class by defining a function with the @wrapt.patch_function_wrapper
annotation. This annotation takes two arguments: the module name and the function name.
The intercepted calls are then retried with the @retry
annotation. This annotation takes several arguments to define the retry strategy. In this example, we will retry up to three times, waiting one second between each attempt, and retry on all exceptions.
Inside the function we add some logging to provide confirmation that our proxy is being called. We then called the wrapped function, which is the original ainvoke
function of the StructuredTool
class.
And that is it! The wrapt
library will intercept all calls to the ainvoke
function of any StructuredTool
object generated by Langchain, and our retry logic will be applied via the tenacity
library.
:::div{.hint} One thing to watch out for using the proxy strategy is that we are wrapping an async function. Not all retry and circuit breaker libraries support async function. You’ll need to keep this in mind if you want to use other resilience libraries. :::
Timeouts in Langchain
Timeouts can be defined through a parameter passed to the ClientSession
constructor. The MultiServerMCPClient
constructor exposes the session_kwargs
argument whose values are passed to the ClientSession
constructor.
This example demonstrates how set a read timeout of 60 seconds for a specific MCP server:
client = MultiServerMCPClient(
{
"octopus": {
"command": "npx",
"args": [
"-y",
"@octopusdeploy/mcp-server",
"--api-key",
os.getenv("PROD_OCTOPUS_APIKEY"),
"--server-url",
os.getenv("PROD_OCTOPUS_URL"),
],
"transport": "stdio",
"session_kwargs": {"read_timeout_seconds": timedelta(seconds=60)},
},
"zendesk": {
"command": "uv",
"args": [
"--directory",
"/home/matthew/Code/zendesk-mcp-server",
"run",
"zendesk",
],
"transport": "stdio",
},
}
)
Circuit breakers in Langchain
Circuit breakers are used to prevent an application from repeatedly trying to execute an operation that is likely to fail. This prevents downstream services that are already struggling from being overwhelmed with requests.
We’ll make use of the purgatory library to implement a circuit breaker for our MCP tools.
The first step is to create a AsyncCircuitBreakerFactory
instance. This instance must be long-lived and shared between requests:
circuitbreaker = AsyncCircuitBreakerFactory(default_threshold=3)
Circuit breakers are only useful in long-lived applications, for example, a web server or a microservice. This is because the circuit breaker logic needs to maintain state about the number of recent failures. A short-lived application, such as a script that runs once and exits, will not benefit from a circuit breaker.
Similar to the retry logic, we’ll use the wrapt
library to create a proxy around the ainvoke
function of the StructuredTool
class, and use the @circuitbreaker
annotation to apply the circuit breaker logic:
@wrapt.patch_function_wrapper("langchain_core.tools", "StructuredTool.ainvoke")
@circuitbreaker("StructuredTool.ainvoke")
async def structuredtool_ainvoke(wrapped, instance, args, kwargs):
print("StructuredTool.ainvoke called")
return await wrapped(*args, **kwargs)
Simulating failures
To simulate failures, we can create a proxy around the ainvoke
function of the BaseTool
class. The StructuredTool
class inherits from the BaseTool
class, and the ainvoke
function of the BaseTool
class is called by the ainvoke
function of the StructuredTool
class. This gives us a convenient place to simulate failures for all tools.
Here we randomly raise an exception two-thirds of the time to simulate a transient error:
@wrapt.patch_function_wrapper("langchain_core.tools", "BaseTool.ainvoke")
async def basetool_ainvoke(wrapped, instance, args, kwargs):
print("BaseTool.ainvoke called")
if random.randint(1, 3) != 3:
print("Simulated transient error")
raise RuntimeError("Simulated transient error")
return await wrapped(*args, **kwargs)
If you have implemented a circuit breaker strategy, you should see that the MCP client eventually stops calling the MCP server after a few failures. If you have implemented a retry strategy with a high level of retries, you should see the prompt succeed as the retry library intercepts the exceptions and retries the request.
Conclusion
Production-grade AI agents need to handle failures gracefully. By implementing timeout, retry, and circuit breaker strategies, we can ensure that our AI agents are resilient and can continue to function even when external systems are experiencing issues.
Langchain has some built-in support for timeouts, but implementing retry and circuit breaker strategies requires some additional work. By using the wrapt
, retry
, and pybreaker
libraries, we can easily add these strategies to our MCP tools via the proxy pattern.
Happy deployments!