claude-coder icon indicating copy to clipboard operation
claude-coder copied to clipboard

streaming slowdown when context rises

Open PierrunoYT opened this issue 1 year ago • 0 comments

The streaming slowdown when context rises can be attributed to several factors in the current implementation:

  1. Context Management Mechanism:
  • Uses a static window approach instead of dynamic sliding to preserve prompt caching
  • When context gets too large, it triggers truncateHalfConversation which can cause delays
  • The system waits until context is too large before compressing, rather than preemptively managing it
  1. Streaming Implementation Bottlenecks:
  • Current debouncer has a fixed 25ms delay for processing chunks
  • All chunks are processed in sequence, which can cause backpressure when context is large
  • The system retries up to 3 times when context is too long, each retry adding latency
  1. Memory Management:
  • Large contexts are kept in memory until they hit the maximum token limit
  • The smart truncation system keeps 8 recent messages intact, which could be excessive for very large contexts
  • Context compression only happens reactively when hitting limits rather than proactively

The slowdown is primarily caused by:

  • The reactive nature of context compression (only happens when hitting limits)
  • Sequential processing of chunks with fixed delays
  • Keeping too many recent messages intact during truncation
  • Multiple retry attempts when context is too long

To improve performance, consider:

  1. Implementing proactive context compression before hitting limits
  2. Adjusting the RECENT_MESSAGES_TO_PRESERVE count based on context size
  3. Using a dynamic debouncer delay based on context size
  4. Implementing parallel chunk processing for large contexts
  5. Adding progressive context compression instead of waiting for full truncation

These changes would help maintain consistent streaming performance even as context size increases.

PierrunoYT avatar Nov 17 '24 17:11 PierrunoYT