semantic-router
semantic-router copied to clipboard
Rolling Window splitter: alternatives to regex for pre-splitting
Regex is good for pre-splitting, but I've noticed it behaves weirdly sometimes, when more context is needed in a pre-split sentence. I propose to add an option to use SpaCy as a sentence pre-splitter
Hi @klein-t, the spirit of the idea is good. Do you have some examples of weird regex behavior that could be fixed by using spaCy sentencizers ?
Hey @bruvduroiu,
sometimes, I'd have a short sentence split in two by a colon, which I'd like to keep as one sentence. Current regex seem to split it in two, SpaCy does not.
Don't get me wrong, I like regex, is fast, but I feel having SpaCy in the loop might help deal with more nuanced scenarios.
Btw, I added a PR #204
I just realize that adding SpaCy might mess with the lower bound for tokens?
lmk