Feature Request: Lazy Loading for MCP Servers and Tools (95% context reduction possible)
Feature Request: Lazy Loading for MCP Servers and Tools
Problem Statement
Currently, Claude Code loads all configured MCP servers, tools, and agents at session startup, consuming significant context before any conversation begins. In my environment:
- MCP tools: 39.8k tokens (19.9%)
- Custom agents: 9.7k tokens (4.9%)
- System tools: 22.6k tokens (11.3%)
- Memory files: 36.0k tokens (18.0%)
- Total: ~108k tokens (54% of 200k limit)
This leaves only 92k tokens for actual conversation and work, severely limiting complex tasks.
Proposed Solution
Implement lazy loading for MCP servers and tools, loading them only when needed based on conversation context.
Core Features
-
Lightweight Registry System
- Load only a small index (~5k tokens) at startup
- Registry contains tool names, descriptions, and trigger keywords
- Tools load on-demand when keywords are detected
-
Intelligent Loading
- Analyze user input for relevant keywords
- Load only required tools for the task
- Cache loaded tools for session duration
- Preload related tools that commonly work together
-
Configuration Options
{ "optimization": { "lazyLoading": true, "maxInitialTokens": 5000, "autoLoadThreshold": 0.8, "cacheMinutes": 30 }, "mcpServers": { "example-server": { "command": "...", "lazyLoad": true, "triggers": ["keyword1", "keyword2"], "preloadWith": ["related-server"] } } }
Benefits
- 95% Token Reduction: From 108k to ~5k initial tokens
- Longer Conversations: 195k tokens available vs 92k currently
- Better Performance: Faster startup, lower memory usage
- Scalability: Can add more tools without context penalty
Implementation Suggestion
Phase 1: Basic Lazy Loading
- Add
lazyLoadflag to MCP server configs - Load registry instead of full tool definitions
- Implement on-demand loading when tool is called
Phase 2: Intelligent Preloading
- Keyword-based auto-loading
- Pattern recognition for common workflows
- Tool relationship mapping
Phase 3: Advanced Optimization
- Session-based learning of tool usage patterns
- Predictive preloading based on project context
- Dynamic unloading of unused tools
User Experience
Before (Current)
Starting session...
Loading 73 MCP tools... [39.8k tokens]
Loading 56 agents... [9.7k tokens]
Loading system tools... [22.6k tokens]
Ready with 92k tokens remaining.
After (With Lazy Loading)
Starting session...
Loading tool registry... [5k tokens]
Ready with 195k tokens available.
User: "I need to build a React component"
> Auto-loading: context7, magic [+3.5k tokens]
> 191.5k tokens remaining
Test Case
I've already built a proof-of-concept in ~/.claude/optimization/ that demonstrates:
- Registry generation from existing MCP configs
- Keyword extraction and mapping
- Lazy loader with pattern matching
- 95% token reduction achieved
Files available for reference:
-
tool-registry.json- Example registry structure -
lazy-loader.py- Proof of concept implementation -
generate-index.py- Registry generation logic
Priority
High - This is a critical limitation for power users with many tools and complex workflows. The current 54% context consumption at startup makes many advanced use cases impossible.
Similar Products
- VSCode: Lazy loads extensions based on file types and activation events
- JetBrains IDEs: Load plugins on-demand based on project type
- Vim/Neovim: Lazy loading plugins (lazy.nvim, vim-plug with lazy loading)
Contact
Happy to provide more details, test beta implementations, or share the proof-of-concept code.
Submitted by: dtaylor Date: July 9, 2025 Claude Code Version: [Current Version] Impact: Affects all users with multiple MCP servers configured
Found 3 possible duplicate issues:
- https://github.com/anthropics/claude-code/issues/6638
- https://github.com/anthropics/claude-code/issues/7172
- https://github.com/anthropics/claude-code/issues/3036
This issue will be automatically closed as a duplicate in 3 days.
- If your issue is a duplicate, please close it and π the existing issue instead
- To prevent auto-closure, add a comment or π this comment
π€ Generated with Claude Code
I've created a proof-of-concept implementation demonstrating this lazy loading system:
π Repository: https://github.com/machjesusmoto/claude-lazy-loading
The implementation shows:
- β 95% token reduction (from 108k to 5k)
- β Working registry generation from MCP configs
- β Intelligent keyword-based loading
- β Preload profiles for common workflows
- β Real-time token tracking
This could be integrated into Claude Code to dramatically improve the experience for power users with multiple MCP servers configured.
@machjesusmoto How does it work? Does it change what mcps / tools are visible when the /mcp command is invoked?
I think this is definitely a must to properly support MCP servers. I was thinking just giving defined Agents the ability to keep their own MCP server references (and being able to select only a sub-set of tools from any given MCP server). Then, for example, if I wanted to implement one of my Linear tasks it would call an Agent ("Implement Linear task KEY-123") with the MCP tools in its context for that specific task and not bloat the main thread with context.
Thank you for the feedback! Let me address the questions:
@lukemmtt - How it works
The proof-of-concept creates a lightweight registry (~500 tokens) that replaces loading all MCP tools upfront:
- At startup: Only the registry loads (5k tokens vs 108k)
- During conversation: The system analyzes input for keywords
- On-demand loading: Only required tools are loaded when detected
- Caching: Loaded tools stay in memory for the session
Currently, it's a simulation showing what's possible. The /mcp command would still show all configured servers, but only the registry would consume tokens until tools are actually needed.
@wizardlyluke - Agent-specific MCP servers
That's a brilliant approach! Agent-scoped MCP servers would be even more efficient. My implementation could extend to support:
{
"agents": {
"linear-implementer": {
"mcp_servers": ["linear", "github"],
"auto_load": false
}
}
}
Regarding Duplicates
While related issues exist (#6638, #7172, #3036), this implementation offers:
- Working code: Not just a request, but functioning proof-of-concept
- 95% reduction achieved: Concrete metrics showing feasibility
- Keyword-based loading: Intelligent detection beyond manual control
- Registry approach: Minimal overhead solution that scales
The key differentiator is having working code that demonstrates the solution. This could help accelerate implementation by providing a reference.
Test It Yourself
git clone https://github.com/machjesusmoto/claude-lazy-loading.git
cd claude-lazy-loading
python3 optimization/lazy-loader.py stats
Happy to collaborate with others experiencing this issue to refine the approach!
Regarding Duplicate Status
After reviewing the related issues, I believe this should remain open as it provides unique value:
Issue Comparison
#3036 - Reports the problem ("MCP servers eat context")
#6638 - Requests dynamic loading/unloading
#7172 - Proposes token management improvements
#7336 (This) - Provides working implementation with:
- π¦ Functional code: https://github.com/machjesusmoto/claude-lazy-loading
- π Proven metrics: 95% reduction achieved
- π§ Registry approach: Novel solution using lightweight index
- π― Keyword detection: Intelligent loading based on context
Why Keep Open
- Solution, not just problem: Others identify the issue; this provides a solution
- Reference implementation: Gives Claude team concrete code to work from
- Community testable: Others can validate the approach with their configs
- Different approach: Registry-based vs manual loading/unloading
I suggest we could:
- Link these issues together as "related"
- Keep this open as the "implementation reference"
- Use others for requirements gathering
What do you think @lukemmtt @wizardlyluke? Should we consolidate discussion or keep the implementation separate?
Seems like a neat implementation, I'll give it a shot this week.
I've personally been getting by with less refined approach: a custom-built script that I made that adds/removes mcps from the actual claude config files (using a separate json file to persistently store the mcp configs)βthe obvious caveat being that I still need to restart the claude code session after each change for it to take effect; your solution sounds much more elegant.
As for consolidating the discussions or not, that's reasonable, but more importantly would be to just mention your solution in the other discussions to tie everything together, and maybe share it in the anthropic / claude / claude code subreddits for visibility, rather than wait for moderator intervention, which seems sparse in this noisy forum.
@lukemmtt Thanks for trying it out! Your config-swapping approach is clever - it's actually complementary to what I built.
Current Limitations
You're absolutely right that this can't truly lazy-load without Claude Code native support. What it does:
- Shows what's possible: Demonstrates 95% reduction is achievable
- Simulates the behavior: Shows which tools would load for given inputs
- Provides the blueprint: Registry structure and loading logic ready for integration
Combining Approaches
Your script + this registry could work together:
# Use lazy-loader to analyze what's needed
python3 lazy-loader.py analyze "building React components today"
# Output: Would load context7, magic
# Use your script to update config with just those MCPs
your-script enable context7 magic
# Restart Claude with minimal MCPs loaded
Next Steps
Great idea about cross-posting! I'll:
- β Comment on related issues (#3036, #6638, #7172) - Done!
- β Post to r/ClaudeLLM and r/AnthropicClaude
- β Share in Claude Discord if there's a channel for it
The real win would be getting this integrated natively. Until then, your config-swapping + my analysis tool might be the best workaround.
Would you be interested in combining efforts? We could:
- Add config manipulation to the lazy-loader
- Auto-generate minimal configs based on task analysis
- Create a wrapper that restarts Claude with optimized MCPs
lol... @machjesusmoto , you got me. Freeze all motor functions
It is worth noting that a runtime MCP Toggle feature has been introduced in Claude Code 2.0.10:
2.0.10
- Rewrote terminal renderer for buttery smooth UI
- Enable/disable MCP servers by @mentioning, or in /mcp
- Added tab completion for shell commands in bash mode
- PreToolUse hooks can now modify tool inputs
- Press Ctrl-G to edit your prompt in your system's configured text editor
- Fixes for bash permission checks with environment variables in the command
Dynamically-loading MCPs is still an interesting concept, but the new ability to enable by "@mentioning" is almost equally valuable for my needs.
It is worth noting that a runtime MCP Toggle feature has been introduced in Claude Code 2.0.10:
2.0.10
- Rewrote terminal renderer for buttery smooth UI
- Enable/disable MCP servers by @mentioning, or in /mcp
- Added tab completion for shell commands in bash mode
- PreToolUse hooks can now modify tool inputs
- Press Ctrl-G to edit your prompt in your system's configured text editor
- Fixes for bash permission checks with environment variables in the command
Dynamically-loading MCPs is still an interesting concept, but the new ability to enable by "@mentioning" is almost equally valuable for my needs.
Wouldn't this feature be the equivalent (though faster and not requiring Claude Code restart) of just removing/adding them manually? The true power of using many MCPs will be when you can get them added per defined agent, or allow the main agent to enable/disable them itself by knowing which MCPs it could have available.
Yes exactly, I would like my main agent to not have any mcp servers and then have specialized research agents etc.. who have mcp severs and use them for example for more detailed research (github mcp, linear, context7) and then report back a condensed summary to the main agent. So the main session always stays clean
2.0.10 made my work on machjesusmoto / mcp-toggle redundant, but then they delivered plugin extensibility in 2.0.12. A plugin system makes makes the project's overarching goal possible when it was previously only a theoretical PoC (machjesusmoto / claude-lazy-loading So, with a lot of caffeine and a little luck, I'll have functional "lazy loading & unloading with a registry being all that's held in context at launch" code to test soon.
I'd like to second @EugenEistrach 's idea of being able to enable select MCPs for specific subagents, while leaving it disabled for the main agent. Currently if you disable it in the main agent, then you cannot enable it for any subagent. I understand that there is a security aspect here so subprocesses can't have great privileges than parent processes. I suppose a true lazy-load would work, but something deterministic would be ideal.
https://www.anthropic.com/engineering/advanced-tool-use Hopefully support for this gets added to Claude Code soon π€
Update: Anthropic Released an Official API Solution
On November 24, 2025, Anthropic released the Tool Search Tool beta feature that directly addresses this problem.
How It Works
| Approach | Token Usage |
|---|---|
| Traditional (all tools upfront) | ~77K tokens |
| With Tool Search Tool | ~8.7K tokens |
| Reduction | 85% |
Accuracy Improvements
- Opus 4: 49% β 74%
- Opus 4.5: 79.5% β 88.1%
API Usage
client.beta.messages.create(
betas=["advanced-tool-use-2025-11-20"],
tools=[
{"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"},
{
"name": "mcp__server__tool",
"description": "...",
"input_schema": {...},
"defer_loading": True # Not loaded until searched
}
]
)
Proposed Claude Code Integration
Since this is now a first-party Anthropic feature, it would be great to see Claude Code CLI support via:
Option 1 - Global config:
{
"betaFeatures": ["advanced-tool-use-2025-11-20"],
"toolSearch": {
"enabled": true,
"defaultDeferLoading": true
}
}
Option 2 - Per-server config:
{
"mcpServers": {
"my-server": {
"command": "...",
"deferLoading": true,
"alwaysLoadTools": ["critical_tool"]
}
}
}
Option 3 - Auto-enable when MCP tool count exceeds a threshold (e.g., 20+ tools)
References
+1 to see how and when claude code will implement this
@ everyone try this: https://github.com/anthropics/claude-code/issues/12836#issuecomment-3629052941
Basically:
- Add the following to your shell file (~/.zshrc, ~/.bashrc, etc)
# Claude Code - Enable experimental MCP-CLI for reduced token consumption
export ENABLE_EXPERIMENTAL_MCP_CLI=true
- Make sure you have claude code version >
2.0.62. - Enjoy what you see in /context π
Running into the same issue and gald to see this work and active development, I would like to chip in an idea on adding a lightweight router model for tool selection for your consideration.
95% token reduction already is awesome, but strict adherence to predefined keywords might hinder some users. Thus -->
Proposed Enhancement: Phase 4 - Intelligent AI Router
The Problem with Static Keyword Matching
While keyword triggers work well for explicit mentions (e.g., "docker" β load docker-mcp), they have limitations:
- Implicit requests: "Show me what containers are running" (no "docker" keyword)
- Multi-domain tasks: "Email the team about the database migration" (needs both gmail + database tools)
- Ambiguous keywords: "push" could mean git, docker, notifications, etc.
- Manual configuration: Requires users to define all triggers upfront
Solution: Two-Phase Architecture with Lightweight Model Router
Phase 1: Lightweight Router (Haiku/Fast Model)
User prompt β Haiku analyzes intent
β
Tool catalog scan (name + description only)
β
Intelligent tool selection
β
Returns minimal tool set
Phase 2: Main Model (Sonnet/Opus)
User prompt + selected tools only β Execute task
How It Works
1. At Session Start
// Haiku loads ultra-lightweight tool catalog (~2-3k tokens)
{
"catalog": [
{
"server": "docker-mcp",
"tool": "list-containers",
"description": "List all Docker containers",
"tags": ["docker", "containers", "processes"]
},
{
"server": "google-workspace",
"tool": "send_gmail_message",
"description": "Send an email via Gmail",
"tags": ["email", "gmail", "communication"]
}
// ... all tools as lightweight entries
]
}
2. On User Input
User: "Show me what containers are running and email the status to [email protected]"
Haiku Router:
1. Analyzes prompt semantics (not just keywords)
2. Identifies intents: [container_listing, email_sending]
3. Searches catalog for relevant tools
4. Returns: [docker-mcp::list-containers, google-workspace::send_gmail_message]
Main Model:
- Receives only 2 tool definitions (~1k tokens)
- Executes task with 98%+ context efficiency
3. Fallback Mechanism
If Router misses tools:
β Main model requests additional tools mid-conversation
β Router re-analyzes with conversation context
β Tools load on-demand
β No session restart needed
Integration with Your Registry System
Your POC already has the foundation! The enhancement would be:
{
"optimization": {
"lazyLoading": true,
"loadingStrategy": "ai-router", // NEW: "keywords" | "ai-router" | "hybrid"
"routerModel": "haiku", // NEW: Fast, cheap model for routing
"fallbackToKeywords": true, // NEW: Use keywords if router unavailable
"maxInitialTokens": 3000,
"cacheMinutes": 30
},
"mcpServers": {
"docker-mcp": {
"lazyLoad": true,
"triggers": ["docker", "container"], // Fallback keywords
"semanticTags": ["containerization", "processes", "services"] // NEW: For AI router
}
}
}
Implementation Phases
Phase 4.1: Basic AI Router
- Haiku analyzes user prompt
- Selects tools from catalog based on semantic understanding
- Main model receives selected tools only
Phase 4.2: Context-Aware Routing
- Router considers conversation history
- Predictive preloading based on workflow patterns
- Learning from tool usage (which tools actually get used)
Phase 4.3: Hybrid Intelligence
- Combines AI routing + your keyword triggers
- Falls back to keywords if AI router is uncertain
- User can override with explicit
@servermentions
Benefits Beyond Keyword Matching
| Feature | Keyword Matching | AI Router |
|---|---|---|
| Explicit mentions | β Excellent | β Excellent |
| Implicit requests | β Misses | β Handles |
| Multi-domain tasks | β οΈ Partial | β Optimizes |
| Context awareness | β No | β Yes |
| Ambiguity resolution | β No | β Yes |
| Setup effort | β οΈ Manual triggers | β Zero config |
| Token overhead | ~5k registry | ~3k catalog |
Real-World Example
Scenario: "Check if the deployment is healthy and notify the team"
Keyword Approach:
Triggers: deployment? notify? team?
β Might load all of: kubernetes, docker, slack, email, github
β 5-10k tokens for tools user might not need
Discussion Questions
- Would you consider such an AI router enhancement to your POC?
- If so, would you think of it as a plugin to your registry system or rather a integrated core feature?
- Any concerns about the additional Haiku API call overhead vs. token savings?
Technical Notes
Edge Cases to Handle:
- Router uncertainty β load broader tool set + use conversation context for refinement
- Consider user overrides to limit tool set to consider by router model by limiting to specific mcp servers by name β Explicit
@servermentions limit to these tools considered by router model which then subsets to the tools used for main model call - Offline/API unavailable/model uncertainty what the best choice would be or lack of adequate tools among the user-subset mcp server(s) β Fall back to keyword matching
Looking forward to hear if this might be of interest and add useful functionality to the keyword registry approach!