langchain
langchain copied to clipboard
RecursiveCharacterTextSplitter strange behavior after v0.0.226
System Info
After v0.0.226, the RecursiveCharacterTextSplitter seems to no longer separate properly at the end of sentences and now cuts many sentences mid-word.
Who can help?
No response
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
splitter = RecursiveCharacterTextSplitter(
chunk_size=450,
chunk_overlap=20,
length_function=len,
#separators=["\n\n", "\n", ".", " ", ""], # tried with and without this
)
Expected behavior
Would like to split at newlines or period marks.