jsdiff icon indicating copy to clipboard operation
jsdiff copied to clipboard

Tokenize Regex as Parameter

Open jongaull-nimbly opened this issue 6 years ago • 4 comments

This PR adds support for a tokenizer parameter which gives the user more control over what constitutes a "token". If the tokenizer parameter is not set then the default regex is used.

jongaull-nimbly avatar Sep 17 '19 17:09 jongaull-nimbly

What's the use case here?

kpdecker avatar Aug 16 '20 19:08 kpdecker

The default regex is /(\s+|[()[\]{}'"]|\b)/ and the one I am using is /(\s+|[()[\]{}'"_]|\b)/.

It's been a while since I worked on this, but it looks like I wanted to add _ as a tokenizing character. I could also see a use-case here for using , for diff-ing CSV data or maybe . for diff-ing file names.

jongaull-nimbly avatar Aug 18 '20 17:08 jongaull-nimbly

I second this functionality. I have a use case for diffing two html strings and this would enable me to adjust the tokenizer to meet my needs.

SkySor44 avatar Dec 01 '21 21:12 SkySor44

Worth thinking about before I merge this - for Chinese and Japanese support, we might need tokenization logic too complicated to be encompassed in a regex, either built in to jsdiff or as something you can plug in yourself: https://github.com/kpdecker/jsdiff/pull/328#issuecomment-1860452680. I want to carefully think through what I ultimately want the API to look like before merging this PR and make sure it's not gonna commit us to an API that's fundamentally incompatible with supporting Chinese and Japanese.

ExplodingCabbage avatar Dec 18 '23 13:12 ExplodingCabbage