[FEATURE] Proactive Context Compression
Problem Statement
The current strand-agents SDK only triggers context compression reactively when a ContextWindowOverflowException is thrown. This reactive approach has several critical limitations:
-
Output Token Starvation: LLM context limits combine input and output tokens. When input tokens approach the limit, the model may have insufficient capacity to generate meaningful responses, leading to truncated or poor-quality outputs.
-
Performance Degradation: Waiting until context overflow occurs means the agent operates at maximum context capacity, potentially degrading response quality and increasing latency.
-
Inefficient Resource Usage: Operating at context limits wastes computational resources and may trigger unnecessary retries.
The current SlidingWindowConversationManager and SummarizingConversationManager only act when reduce_context() is called after an exception, missing opportunities for proactive optimization.
Proposed Solution
Implement Proactive Context Compression that triggers automatically when context usage reaches a configurable threshold (default: 70% of context window limit), preventing context overflow before it occurs.
Core Components
1. Proactive Compression Interface
Extend the ConversationManager abstract base class:
@abstractmethod
def should_compress(self, agent: "Agent", **kwargs: Any) -> bool:
"""Determine if proactive compression should be triggered.
Args:
agent: The agent whose context will be evaluated.
**kwargs: Additional parameters for compression decision.
Returns:
True if compression should be triggered, False otherwise.
"""
pass
@abstractmethod
def get_compression_threshold(self) -> float:
"""Get the current compression threshold as a percentage (0.0-1.0)."""
pass
2. Enhanced Apply Management
Modify apply_management() to support proactive compression:
def apply_management(self, agent: "Agent", **kwargs: Any) -> None:
"""Apply management strategy including proactive compression."""
if self.should_compress(agent, **kwargs):
self.reduce_context(agent, **kwargs)
3. Configuration Options
Add configuration parameters to conversation managers:
class ProactiveCompressionConfig:
compression_threshold: float = 0.7 # 70% of context window
enable_proactive_compression: bool = True
min_messages_before_compression: int = 2
compression_cooldown_messages: int = 1 # Prevent rapid re-compression
Implementation Strategy
Phase 1: Infrastructure
- Extend
ConversationManagerbase class with proactive compression methods - Add compression threshold configuration to existing managers
- Implement token estimation utilities
Phase 2: Manager Updates
- Update
SummarizingConversationManagerwith proactive compression - Update
SlidingWindowConversationManagerwith threshold-based trimming - Add compression cooldown logic to prevent thrashing
Phase 3: Integration
- Integrate with
Agent._run_loop()for automatic triggering - Add telemetry and metrics for compression events
- Implement configuration validation and error handling
Use Case
1. Long-Running Conversations
agent = Agent(
model=model,
conversation_manager=SummarizingConversationManager(
enable_proactive_compression=True,
compression_threshold=0.7, # Compress at 70% capacity
preserve_recent_messages=10
)
)
# Agent automatically compresses context before hitting limits
for i in range(100):
result = agent(f"Tell me about topic {i}")
# Compression happens transparently when needed
2. Tool-Heavy Workflows
agent = Agent(
model=model,
tools=[data_analysis_tool, api_client, file_processor],
conversation_manager=SummarizingConversationManager(
compression_threshold=0.6, # More aggressive for tool-heavy workflows
preserve_recent_messages=5
)
)
# Large tool results are compressed before context overflow
result = agent("Process this large dataset and generate insights")
3. Multi-Agent Scenarios
# Coordinator agent with proactive compression
coordinator = Agent(
conversation_manager=SlidingWindowConversationManager(
window_size=50,
compression_threshold=0.8, # Conservative threshold
enable_proactive_compression=True
)
)
Alternatives Solutions
1. Reactive Compression with Prediction
- Keep current reactive model but add predictive logic
- Pros: Minimal changes to existing architecture
- Cons: Still risks output token starvation, doesn't solve core problem
2. Dynamic Context Window Adjustment
- Automatically adjust context window size based on usage patterns
- Pros: More flexible resource usage
- Cons: Complex implementation, model-dependent behavior
Additional Context
No response
It seems like a very interesting feature. Is its implementation planned?
I ran into this same problem and realized the SDK is missing two foundational pieces needed to implement threshold-based compression:
- No way to estimate current token usage - #1294 proposes adding
estimate_tokens()to the Model interface - No way to know the context limit - #1295 proposes adding a
context_limitproperty to Model
Without these, you cannot calculate what "70% of context window" actually means. Once both are available, proactive compression becomes straightforward.