claude-coder
claude-coder copied to clipboard
streaming slowdown when context rises
The streaming slowdown when context rises can be attributed to several factors in the current implementation:
- Context Management Mechanism:
- Uses a static window approach instead of dynamic sliding to preserve prompt caching
- When context gets too large, it triggers truncateHalfConversation which can cause delays
- The system waits until context is too large before compressing, rather than preemptively managing it
- Streaming Implementation Bottlenecks:
- Current debouncer has a fixed 25ms delay for processing chunks
- All chunks are processed in sequence, which can cause backpressure when context is large
- The system retries up to 3 times when context is too long, each retry adding latency
- Memory Management:
- Large contexts are kept in memory until they hit the maximum token limit
- The smart truncation system keeps 8 recent messages intact, which could be excessive for very large contexts
- Context compression only happens reactively when hitting limits rather than proactively
The slowdown is primarily caused by:
- The reactive nature of context compression (only happens when hitting limits)
- Sequential processing of chunks with fixed delays
- Keeping too many recent messages intact during truncation
- Multiple retry attempts when context is too long
To improve performance, consider:
- Implementing proactive context compression before hitting limits
- Adjusting the RECENT_MESSAGES_TO_PRESERVE count based on context size
- Using a dynamic debouncer delay based on context size
- Implementing parallel chunk processing for large contexts
- Adding progressive context compression instead of waiting for full truncation
These changes would help maintain consistent streaming performance even as context size increases.