langchain icon indicating copy to clipboard operation
langchain copied to clipboard

RecursiveCharacterTextSplitter strange behavior after v0.0.226

Open austinmw opened this issue 1 year ago • 0 comments

System Info

After v0.0.226, the RecursiveCharacterTextSplitter seems to no longer separate properly at the end of sentences and now cuts many sentences mid-word.

Who can help?

No response

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [ ] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=20,
    length_function=len,
    #separators=["\n\n", "\n", ".", " ", ""], # tried with and without this
)

Expected behavior

Would like to split at newlines or period marks.

austinmw avatar Jul 10 '23 16:07 austinmw