spring-ai icon indicating copy to clipboard operation
spring-ai copied to clipboard

GH-2123: Add chunkOverlap support to TokenTextSplitter

Open ralla0405 opened this issue 3 months ago • 1 comments
trafficstars

Summary

This PR adds chunk overlap functionality to the TokenTextSplitter class to improve text chunking for better context preservation in document processing.

Closes #2123

Changes

  • TokenTextSplitter class enhancements:

    • Added chunkOverlap field with default value of 50 tokens
    • Updated constructor and builder to support chunk overlap configuration
    • Added validation to ensure chunkOverlap < chunkSize
    • Refactored doSplit method to implement overlap logic
    • Added optimizeChunkBoundary method for sentence-aware splitting
  • Test improvements:

    • Added testChunkOverlapFunctionality to verify overlap behavior
    • Added testChunkOverlapValidation for input validation
    • Added testBoundaryOptimizationWithOverlap for sentence boundary testing
    • Added testKeepSeparatorVariations for separator handling
    • Updated existing tests to handle dynamic chunk counts with overlap

Key Features

  • Configurable overlap: Allows overlapping tokens between consecutive chunks
  • Boundary optimization: Attempts to split at sentence boundaries when possible
  • Input validation: Prevents invalid overlap configurations
  • Backward compatibility: Maintains existing API with sensible defaults

Test Coverage

All new functionality is covered by comprehensive unit tests that verify:

  • Overlap functionality works correctly
  • Input validation prevents invalid configurations
  • Boundary optimization improves chunk quality
  • Metadata handling remains consistent across chunks

ralla0405 avatar Aug 07 '25 09:08 ralla0405