jsdiff
jsdiff copied to clipboard
Tokenize Regex as Parameter
This PR adds support for a tokenizer parameter which gives the user more control over what constitutes a "token". If the tokenizer parameter is not set then the default regex is used.
What's the use case here?
The default regex is /(\s+|[()[\]{}'"]|\b)/ and the one I am using is /(\s+|[()[\]{}'"_]|\b)/.
It's been a while since I worked on this, but it looks like I wanted to add _ as a tokenizing character. I could also see a use-case here for using , for diff-ing CSV data or maybe . for diff-ing file names.
I second this functionality. I have a use case for diffing two html strings and this would enable me to adjust the tokenizer to meet my needs.
Worth thinking about before I merge this - for Chinese and Japanese support, we might need tokenization logic too complicated to be encompassed in a regex, either built in to jsdiff or as something you can plug in yourself: https://github.com/kpdecker/jsdiff/pull/328#issuecomment-1860452680. I want to carefully think through what I ultimately want the API to look like before merging this PR and make sure it's not gonna commit us to an API that's fundamentally incompatible with supporting Chinese and Japanese.