semantic-router icon indicating copy to clipboard operation
semantic-router copied to clipboard

Rolling Window splitter: alternatives to regex for pre-splitting

Open klein-t opened this issue 1 year ago • 2 comments

Regex is good for pre-splitting, but I've noticed it behaves weirdly sometimes, when more context is needed in a pre-split sentence. I propose to add an option to use SpaCy as a sentence pre-splitter

klein-t avatar Mar 12 '24 15:03 klein-t

Hi @klein-t, the spirit of the idea is good. Do you have some examples of weird regex behavior that could be fixed by using spaCy sentencizers ?

bruvduroiu avatar Mar 15 '24 04:03 bruvduroiu

Hey @bruvduroiu,

sometimes, I'd have a short sentence split in two by a colon, which I'd like to keep as one sentence. Current regex seem to split it in two, SpaCy does not.

Don't get me wrong, I like regex, is fast, but I feel having SpaCy in the loop might help deal with more nuanced scenarios.

Btw, I added a PR #204

I just realize that adding SpaCy might mess with the lower bound for tokens?

lmk

klein-t avatar Mar 15 '24 13:03 klein-t