The promise of MCP is to expose many platforms and services to AI models, enabling complex queries and workflows to be executed with natural language prompts.
While it is tempting to believe that MCP clients can define arbitrarily complex workflows in a single prompt, in practice, the limitations of the current generation of LLMs present challenges that must be overcome. Specifically, the context window size of LLMs defines an upper limit on how much data a single MCP prompt can consume as part of a request.
In this post we’ll explore strategies for managing context window size limitations when working with AI agents and Octopus Deploy.
What is context window size?
You can think of the context window as being the amount of information an LLM can process.
Context windows are measured in tokens, which are chunks of text that can be as short as one character or as long as one word. For example, the word “chat” could be one token, while the word “chatting” could be two tokens (“chatt” and “ing”). While there is not an exact ratio between tokens and words or characters (and the ratio changes between LLMs), a rough approximation is that one token is equals four characters of English text. The Amazon documentation for Amazon Titan models notes that:
The characters to token ratio in English is 4.7 characters per token, on average.
The context window size is dependent on the specific LLM being used. For example, OpenAI’s gpt-5 model has a context window size of 400,000 tokens (272,000 tokens for input and 128,000 tokens for output), while some variations of the gpt-4 models have a context window size of 32,000 tokens.
These sound like large numbers, but it doesn’t take long to exhaust the context window size when working with large blobs of text or API results. JSON in particular consumes a lot of tokens. In this screenshot from the OpenAI Tokenizer tool, you can see that individual quotes and braces are represented as individual tokens:
Prompts that exhaust the context window
Consider the following prompt:
In Octopus, get the last 10 releases deployed to the "Production" environment in the "Octopus Server" space.
Get the releases from the deployments.
In ZenDesk, get the last 100 tickets and their comments.
Create a report summarizing the issues reported by customers in the tickets.
You must only consider tickets that mention the Octopus release versions.
You must only consider support tickets raised by customers.
You must use your best judgment to identify support tickets.
The intention here is to write a report that summarizes customer issues based on the last 10 releases of an application deployed to production. It is simple enough to write this prompt, but behind the scenes, the LLM must execute multiple API calls:
- Convert the space name to a space ID
- Convert the environment name to an environment ID
- Get the last 10 deployments to the environment
- Get the details of the releases from the deployments
- Get the last 100 tickets from ZenDesk
Each of these API calls return token-gobbling JSON results that are collected and passed to the LLM to generate the report. The JSON blobs returned by Octopus can be quite verbose, and it is not hard to see how long support tickets can exhaust the context window size, especially given the tendency of email clients to include the entirety of a previous email chain in each reply.
Even if we don’t exhaust the context window size, we may still benefit from reducing the amount of data passed to the LLM, as this post from Anthropic notes:
Studies on needle-in-a-haystack style benchmarking have uncovered the concept of context rot: as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases.
When prompts like this work, they seem almost magical. But when they fail due to context window size limitations, we need to implement some advanced strategies to help manage the context window size.
Strategies for managing context window size
As we saw in the previous post, LangChain provides the ability to extract tools from MCP servers and use them in agentic workflows. We can add additional custom tools to this collection to perform operations that help manage context window size.
Consider this tool definition:
@tool
def discard_deployments(
tool_call_id: Annotated[str, InjectedToolCallId],
state: Annotated[dict, InjectedState],
) -> Command:
"""Discards the list of deployments."""
def trim_release(release):
if isinstance(release, ToolMessage) and release.name == "list_deployments":
release.name = "trimmed_list_deployments"
release.content = ""
return release
trim_messages = [trim_release(msg) for msg in state["messages"]]
return Command(
update={
"messages": [
RemoveMessage(id=REMOVE_ALL_MESSAGES),
*trim_messages,
ToolMessage(
"Discarded list of deployments", tool_call_id=tool_call_id
),
],
}
)
This tool take advantage of a number of advanced features of LangChain:
- The InjectedToolCallId annotation to inject the unique ID of the tool call
- The InjectedState annotation to inject the current state of the agent
- Returning a Command object to update the state of the agent
- Using the RemoveMessage class to remove all messages from the agent’s state
Let’s go through this function line by line.
We start by defining a tool. A tool is simply a function decorated with the @tool
decorator. The function docstring is used to describe the tool to the LLM, and is how the LLM knows when to call the tool based on the plain text instructions in the prompt.
This two has two parameters, tool_call_id
and state
.
The tool_call_id
parameter is annotated with the InjectedToolCallId
annotation, which tells LangChain to inject the unique ID of the tool call into this parameter.
The state
parameter is annotated with the InjectedState
annotation, which tells LangChain to inject the current state of the agent into this parameter. It is this state that we want to modify:
@tool
def discard_deployments(
tool_call_id: Annotated[str, InjectedToolCallId],
state: Annotated[dict, InjectedState],
) -> Command:
"""Discards the list of deployments."""
A nested function called trim_release
is defined to process each message in the agent’s state. If the message is a ToolMessage
with the name list_deployments
(this is the name of the tool exposed by the Octopus MCP server), it changes the name to trimmed_list_deployments
and clears the content. This effectively removes the verbose JSON content from the message while retaining a record that the tool was called:
def trim_release(release):
if isinstance(release, ToolMessage) and release.name == "list_deployments":
release.name = "trimmed_list_deployments"
release.content = ""
return release
We then use a list comprehension to apply the trim_release
function to each message in the agent’s state, producing a new list of messages with the deployments trimmed:
trim_messages = [trim_release(msg) for msg in state["messages"]]
We then return a Command
object that updates the agent’s state. Command
objects allow us to update the state of the agent. It is the messages in the state that are placed in the context window, so by modifying these messages, we can manage the context window size.
By default, messages returned from a tool are appended to the existing messages in the state. However, in this case, we want to remove all existing messages and replace them with our trimmed messages. We do this by including a RemoveMessage
object with the special ID REMOVE_ALL_MESSAGES
, which tells LangChain to remove all existing messages from the state.
Finally, we include our trimmed messages and a new ToolMessage
indicating that the deployments have been discarded. This message includes the tool_call_id
so that it can be traced back to the specific tool call:
return Command(
update={
"messages": [
RemoveMessage(id=REMOVE_ALL_MESSAGES),
*trim_messages,
ToolMessage(
"Discarded list of deployments", tool_call_id=tool_call_id
),
],
}
)
Our custom tool is added to the collection of tools exported from the MCP servers:
tools = await client.get_tools()
tools.append(discard_deployments)
And we can call the new tool from our prompt with the instruction Discard the list of deployments
:
In Octopus, get the last 10 releases deployed to the "Production" environment in the "Octopus Server" space.
Get the releases from the deployments.
Discard the list of deployments.
In ZenDesk, get the last 100 tickets and their comments.
Create a report summarizing the issues reported by customers in the tickets.
You must only consider tickets that mention the Octopus release versions.
You must only consider support tickets raised by customers.
You must use your best judgement to identify support tickets.
Now, once the LLM has called the Octopus MCP server to get the list of deployments, and retrieved the releases from those deployments, it calls our custom tool to discard the deployments JSON blob from the state, which in turn means those messages are not passed to the LLM as part of the context window. The deployments were never needed for the final report, so we have reduced the amount of data passed to the LLM without losing any important information.
There are a number of other opportunities to reduce the size of the messages passed to the LLM. The JSON blobs related to releases can be replaced by the release versions, and the ZenDesk tickets can be trimmed.
Full source code
This is the complete source code, including the additional custom tools used to trim the release details to just the version (trim_releases_to_version
) and trim the ticket descriptions to 1000 characters (trim_ticket_descriptions
), and the additional instructions in the prompt to call these tools:
import asyncio
import json
import os
import re
from typing import Annotated
from langchain_core.messages import RemoveMessage, ToolMessage, trim_messages
from langchain_core.tools import tool, InjectedToolCallId
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain_azure_ai.chat_models import AzureAIChatCompletionsModel
from langgraph.graph.message import REMOVE_ALL_MESSAGES
from langgraph.prebuilt import create_react_agent, InjectedState
from langgraph.types import Command
def remove_line_padding(text):
"""
Remove leading and trailing whitespace from each line in the text.
:param text: The text to process.
:return: The text with leading and trailing whitespace removed from each line.
"""
return "\n".join(line.strip() for line in text.splitlines() if line.strip())
def remove_thinking(text):
"""
Remove <think>...</think> tags and their content from the text.
:param text: The text to process.
:return: The text with <think>...</think> tags and their content removed.
"""
stripped_text = text.strip()
if stripped_text.startswith("<think>") and "</think>" in stripped_text:
return re.sub(r"<think>.*?</think>", "", stripped_text, flags=re.DOTALL)
return stripped_text
def response_to_text(response):
"""
Extract the content from the last message in the response.
:param response: The response dictionary containing messages.
:return: The content of the last message, or an empty string if no messages are present.
"""
messages = response.get("messages", [])
if not messages or len(messages) == 0:
return ""
return messages.pop().content
@tool
def trim_ticket_descriptions(
tool_call_id: Annotated[str, InjectedToolCallId],
state: Annotated[dict, InjectedState],
) -> Command:
"""Trims the description of the ZenDesk tickets."""
def trim_description(ticket):
ticket["description"] = (
ticket["description"][:1000] + "..."
if len(ticket["description"]) > 1000
else ticket["description"]
)
return ticket
def trim_description_list(message):
if isinstance(message, ToolMessage) and message.name == "get_tickets":
ticket_data = json.loads(message.content)
trimmed_ticket_data = [
trim_description(ticket) for ticket in ticket_data["tickets"]
]
message.content = json.dumps(trimmed_ticket_data)
return message
trim_messages = [trim_description_list(msg) for msg in state["messages"]]
return Command(
update={
"messages": [
RemoveMessage(id=REMOVE_ALL_MESSAGES),
*trim_messages,
ToolMessage("Trimmed releases to version", tool_call_id=tool_call_id),
],
}
)
@tool
def discard_deployments(
tool_call_id: Annotated[str, InjectedToolCallId],
state: Annotated[dict, InjectedState],
) -> Command:
"""Discards the list of deployments."""
def trim_release(release):
if isinstance(release, ToolMessage) and release.name == "list_deployments":
release.name = "trimmed_list_deployments"
release.content = ""
return release
trim_messages = [trim_release(msg) for msg in state["messages"]]
return Command(
update={
"messages": [
RemoveMessage(id=REMOVE_ALL_MESSAGES),
*trim_messages,
ToolMessage("Discarded list of deployments", tool_call_id=tool_call_id),
],
}
)
@tool
def trim_releases_to_version(
tool_call_id: Annotated[str, InjectedToolCallId],
state: Annotated[dict, InjectedState],
) -> Command:
"""Trims the details of Octopus releases to their version."""
def trim_release(release):
if isinstance(release, ToolMessage) and release.name == "get_release_by_id":
release_data = json.loads(release.content)
release.name = "trimmed_release"
release.content = release_data.get("version", "Unknown Version")
return release
trim_messages = [trim_release(msg) for msg in state["messages"]]
return Command(
update={
"messages": [
RemoveMessage(id=REMOVE_ALL_MESSAGES),
*trim_messages,
ToolMessage("Trimmed releases to version", tool_call_id=tool_call_id),
],
}
)
async def main():
"""
The entrypoint to our AI agent.
"""
client = MultiServerMCPClient(
{
"octopus": {
"command": "npx",
"args": [
"-y",
"@octopusdeploy/mcp-server",
"--api-key",
os.getenv("PROD_OCTOPUS_APIKEY"),
"--server-url",
os.getenv("PROD_OCTOPUS_URL"),
],
"transport": "stdio",
},
"zendesk": {
"command": "uv",
"args": [
"--directory",
"/home/matthew/Code/zendesk-mcp-server",
"run",
"zendesk",
],
"transport": "stdio",
},
}
)
# Use an Azure AI model
llm = AzureAIChatCompletionsModel(
endpoint=os.getenv("AZURE_AI_URL"),
credential=os.getenv("AZURE_AI_APIKEY"),
model="gpt-5-mini",
)
tools = await client.get_tools()
tools.append(discard_deployments)
tools.append(trim_releases_to_version)
tools.append(trim_ticket_descriptions)
agent = create_react_agent(llm, tools)
response = await agent.ainvoke(
{
"messages": remove_line_padding(
"""
In Octopus, get the last 10 releases deployed to the "Production" environment in the "Octopus Server" space.
Get the releases from the deployments.
Trim the details of Octopus releases to their version.
Discard the list of deployments.
In ZenDesk, get the last 100 tickets and their comments.
Trim the description of the ZenDesk tickets.
Create a report summarizing the issues reported by customers in the tickets.
You must only consider tickets that mention the Octopus release versions.
You must only consider support tickets raised by customers.
You must use your best judgment to identify support tickets.
"""
)
}
)
print(remove_thinking(response_to_text(response)))
asyncio.run(main())
Alternative strategies
LangChain also exposes the pre-model hook and post-model hook to allow you to manipulate the state of the AI agent at various points in the processing of a request. The pre-model hook specifically is designed to support message trimming and summarization as a way to manage context window size.
Conclusion
The ability of MCP to define complex, multi-system workflows in natural language is almost magical. By hiding the complexity of API calls behind simple prompts, MCP empowers users to automate tasks that would otherwise require significant custom code.
However, as you work with larger datasets and more complex workflows, you will encounter limitations around LLM context window sizes, and at this point you will need to implement strategies to manage the context window size.
Fortunately, LangChain exposes a number of advanced features that provide a deep level of control over the state of the agent, which in turn allows you to manage the context window size effectively.
This post provided examples of custom tools that manipulated the state of the agent to trim or discard unnecessary data. This allows you to work with data more efficiently and allows your prompts to scale across more systems and larger datasets.
Happy deployments!