The Hidden Cost of MCPs and Custom Instructions on Your Context Window

Context Window Usage

Large context windows sound limitless—200K, 400K, even a million tokens. But once you bolt on a few MCP servers, dump in a giant CLAUDE.md, and drag a long chat history behind you, you can easily burn over 50% of that window before you paste a single line of code.

This post is about that hidden tax—and how to stop paying it.

Where This Started

This exploration started when I came across a LinkedIn post by Johnny Winter featuring a YouTube video about terminal-based AI tools and context management. The video demonstrates how tools like Claude Code, Gemini CLI, and others leverage project-aware context files—which got me thinking about what’s actually consuming all that context space.

Video by NetworkChuck

ℹ️ Note: While this post uses Claude Code for examples, these concepts apply to any AI coding agent—GitHub Copilot, Cursor, Windsurf, Gemini CLI, and others.

The Problem: You’re Already at 50% Before You Start

Think of a context window as working memory. Modern AI models have impressive limits (as of 2025):

Claude Sonnet 4.5: 200K tokens (1M beta for tier 4+)
GPT-5: 400K tokens via API
Gemini 3 Pro: 1M input tokens

A token is roughly 3-4 characters, so 200K tokens equals about 150,000 words. That sounds like plenty, right?

Here’s what actually consumes it:

System prompt and system tools
MCP server tool definitions
Memory files (CLAUDE.md, .cursorrules)
Autocompact buffer (reserved for conversation management)
Conversation history
Your code and the response being generated

By the time you add a few MCPs and memory files, a large chunk of your context window is already gone—before you’ve written a single line of code.

Real Numbers: The MCP Tax

Model Context Protocol (MCP) servers make it easier to connect AI agents to external tools and data. But each server you add costs tokens.

Here’s what my actual setup looked like (from Claude Code’s /context command):

Context Window Usage

MCP tools alone consume 16.3% of the context window—before I’ve even started a conversation. Combined with system overhead, I’m already at 51% usage with essentially zero messages.

The Compounding Effect

The real problem emerges when overhead compounds. Here’s my actual breakdown:

Category	Tokens	% of Window
System prompt	3.0k	1.5%
System tools	14.8k	7.4%
MCP tools	32.6k	16.3%
Custom agents	794	0.4%
Memory files	5.4k	2.7%
Messages	8	0.0%
Autocompact buffer	45.0k	22.5%
Free space	99k	49.3%

Total: 101k/200k tokens used (51%)

You’re working with less than half your theoretical capacity—and that’s with essentially zero conversation history. Once you start coding, the available space shrinks even further.

Context Consumption Details

Why This Matters: Performance and Quality

Context consumption affects more than just space:

Processing Latency: Empirical testing with GPT-4 Turbo shows that time to first token increases by approximately 0.24ms per input token. That means every additional 10,000 tokens adds roughly 2.4 seconds of latency to initial response time. (Source: Glean’s research on input token impact)

Cache Invalidation: Modern AI systems cache frequently used context. Any change (adding an MCP, editing instructions) invalidates that cache, forcing full reprocessing.

Quality Degradation: When context gets tight, models may:

Skip intermediate reasoning steps
Miss edge cases
Spread attention too thinly across information
Fill gaps with plausible but incorrect information
Truncate earlier conversation, losing track of prior requirements

I’ve noticed this particularly in long coding sessions. After discussing architecture early in a conversation, the agent later suggests solutions that contradict those earlier decisions—because that context has been truncated away.

Practical Optimization: Real-World Example

Let me share a before/after from my own setup:

Before Optimization:

10+ MCPs enabled (all the time)
MCP tools consuming 32.6k tokens (16.3%)
Only 99k tokens free (49.3%)
Frequent need to summarize/restart sessions

After Optimization:

3-4 MCPs enabled by default
MCP tools reduced to ~12k tokens (~6%)
Memory files trimmed to essentials (~3k tokens)
Over 140k tokens free (70%+)

Results: More working space, better reasoning quality, fewer context limit issues, and faster responses.

Optimization Checklist

Before adding another MCP or expanding instructions:

Have I measured my current context overhead?
Is my custom instruction file under 5,000 tokens?
Do I actively use all enabled MCPs?
Have I removed redundant or outdated instructions?
Could I accomplish this goal without consuming more context?

In Claude Code: Use the /context command to see your current context usage breakdown.

Specific Optimization Strategies

1. Audit Your MCPs Regularly

Ask yourself:

Do I use this MCP daily? Weekly? Monthly?
Could I accomplish this task without the MCP?

Action: Disable MCPs you don’t use regularly. Enable them only when needed.

MCP Enable/Disable Settings

Impact of Selective MCP Usage

Context Window with Some MCPs Off

By selectively disabling MCPs you don’t frequently use, you can immediately recover significant context space. This screenshot shows the difference in available context when strategically choosing which MCPs to keep active versus loading everything.

In Claude Code, you can toggle MCPs through the settings panel. This simple action can recover 10-16% of your context window.

2. Ruthlessly Edit Custom Instructions

Your CLAUDE.md memory files, .cursorrules, or copilot-instructions.md should be:

Concise (under 5,000 tokens)
Focused on patterns, not examples
Project-specific, not general AI guidance

Bad Example:

When writing code, always follow best practices. Use meaningful
variable names. Write comments. Test your code. Follow SOLID
principles. Consider performance. Think about maintainability...

(Continues for 200 lines)

Good Example:

Code Style:
- TypeScript strict mode
- Functional patterns preferred
- Max function length: 50 lines
- All public APIs must have JSDoc

Testing:
- Vitest for unit tests
- Each function needs test coverage
- Mock external dependencies

3. Start Fresh When Appropriate

Long conversations accumulate context. Sometimes the best optimization is:

Summarizing what’s been decided
Starting a new session with that summary
Dropping irrelevant historical context

4. Understand Autocompact Buffer

Claude Code includes an autocompact buffer that helps manage context automatically. When you run /context, you’ll see something like:

Autocompact buffer: 45.0k tokens (22.5%)

This buffer reserves space to prevent hitting hard token limits by automatically compacting or summarizing older messages during long conversations. It maintains continuity without abrupt truncation—but it also means that 22.5% of your window is already taken.

You can also see and toggle this behavior in Claude Code’s /config settings:

Claude Code Autocompact Config

In this screenshot, Auto-compact is enabled, which keeps a dedicated buffer for summarizing older messages so long conversations stay coherent without suddenly hitting hard context limits.

Claude Code Specific Limitations: The Granularity Problem

Claude Code currently has a platform-level limitation that makes fine-grained control challenging, documented in GitHub Issue #7328: “MCP Tool Filtering”.

The Core Issue: Claude Code loads ALL tools from configured MCP servers. You can only enable or disable entire servers, not individual tools within a server.

The Impact: Large MCP servers with 20+ tools can easily consume 50,000+ tokens just on definitions. If a server has 25 tools but you only need 3, you must either:

Load all 25 tools and accept the context cost
Disable the entire server and lose access to the 3 tools you need
Build a custom minimal MCP server (significant development effort)

This makes tool-level filtering essential for context optimization, not just a convenience. The feature is under active development with community support. In the meantime:

Use MCP servers sparingly
Prefer smaller, focused servers over large multi-tool servers
Regularly audit which servers you actually need enabled
Provide feedback on the GitHub issues to help prioritize this feature

Key Takeaways

You’re burning a huge portion of your context window before you even paste in your first file. MCP tools alone can consume 16%+ of your window. System tools add another 7%. The autocompact buffer reserves 22%. It adds up fast.

Optimization is ongoing. Regular audits of MCPs and memory files keep your agent running smoothly. Aim to keep baseline overhead under 30% of total context (excluding the autocompact buffer).

Measurement matters. Use /context in Claude Code to monitor your overhead. You can’t optimize what you don’t measure.

Performance degrades subtly. Latency increases roughly 2.4 seconds per 10,000 tokens based on empirical testing. Reasoning quality drops as context fills up.

Start minimal, add intentionally. The best developers using AI agents:

Start minimal
Add capabilities intentionally
Monitor performance impact
Optimize regularly
Remove what isn’t providing value

The goal isn’t to minimize context usage at all costs. The goal is intentional, efficient context usage that maximizes response quality, processing speed, and available working space.

Think of your context window like RAM in a computer. More programs running means less memory for each program. Eventually, everything slows down.

It’s not about having every tool available. It’s about having the right tools, configured optimally, for the work at hand.

Resources

Official Documentation

Research & Performance

How Input Token Count Impacts LLM Latency - Glean

Community Resources

Have you optimized your AI agent setup? What context window challenges have you encountered? I’d love to hear your experiences and optimization strategies.

Self-Service BI Blog

The Hidden Cost of MCPs and Custom Instructions on Your Context Window

Where This Started

The Problem: You’re Already at 50% Before You Start

Real Numbers: The MCP Tax

The Compounding Effect

Why This Matters: Performance and Quality

Practical Optimization: Real-World Example

Optimization Checklist

Specific Optimization Strategies

1. Audit Your MCPs Regularly

Impact of Selective MCP Usage

2. Ruthlessly Edit Custom Instructions

3. Start Fresh When Appropriate

4. Understand Autocompact Buffer

Claude Code Specific Limitations: The Granularity Problem

Key Takeaways

Resources

Official Documentation

Research & Performance

Community Resources

About the Author

Mihaly Kavasi

Where This Started

The Problem: You’re Already at 50% Before You Start

Real Numbers: The MCP Tax

The Compounding Effect

Why This Matters: Performance and Quality

Practical Optimization: Real-World Example

Optimization Checklist

Specific Optimization Strategies

1. Audit Your MCPs Regularly

Impact of Selective MCP Usage

2. Ruthlessly Edit Custom Instructions

3. Start Fresh When Appropriate

4. Understand Autocompact Buffer

Claude Code Specific Limitations: The Granularity Problem

Key Takeaways

Resources

Official Documentation

Research & Performance

Community Resources

Share this post

Related Posts

About the Author

Mihaly Kavasi