langchain icon indicating copy to clipboard operation
langchain copied to clipboard

fix markdown text splitter horizontal lines

Open IlyaMichlin opened this issue 1 year ago • 1 comments

Fixes #5614

Issue

The *** combination produces an exception when used as a seperator in re.split. Instead \*\*\* should be used for regex exprations.

Who can review?

@eyurtsev

IlyaMichlin avatar Jun 02 '23 18:06 IlyaMichlin

@devstein, I appreciate your suggestion, but using re.escape would also escape the '' character in '\n', which isn't the intended behavior. It's important to note that these strings are regular expressions, so they should be explicitly defined for the sake of clarity and to better comprehend their functionality.

IlyaMichlin avatar Jun 03 '23 04:06 IlyaMichlin

@hwchase17 I added a test for markdown. There were a few issues with the regex expressions for RST and Markdown which I fixed and added test covering this.

IlyaMichlin avatar Jun 05 '23 05:06 IlyaMichlin