spring-ai
spring-ai copied to clipboard
GH-2123: Add chunkOverlap support to TokenTextSplitter
trafficstars
Summary
This PR adds chunk overlap functionality to the TokenTextSplitter class to
improve text chunking for better context preservation in document processing.
Closes #2123
Changes
-
TokenTextSplitter class enhancements:
- Added
chunkOverlapfield with default value of 50 tokens - Updated constructor and builder to support chunk overlap configuration
- Added validation to ensure
chunkOverlap < chunkSize - Refactored
doSplitmethod to implement overlap logic - Added
optimizeChunkBoundarymethod for sentence-aware splitting
- Added
-
Test improvements:
- Added
testChunkOverlapFunctionalityto verify overlap behavior - Added
testChunkOverlapValidationfor input validation - Added
testBoundaryOptimizationWithOverlapfor sentence boundary testing - Added
testKeepSeparatorVariationsfor separator handling - Updated existing tests to handle dynamic chunk counts with overlap
- Added
Key Features
- Configurable overlap: Allows overlapping tokens between consecutive chunks
- Boundary optimization: Attempts to split at sentence boundaries when possible
- Input validation: Prevents invalid overlap configurations
- Backward compatibility: Maintains existing API with sensible defaults
Test Coverage
All new functionality is covered by comprehensive unit tests that verify:
- Overlap functionality works correctly
- Input validation prevents invalid configurations
- Boundary optimization improves chunk quality
- Metadata handling remains consistent across chunks