spellr icon indicating copy to clipboard operation
spellr copied to clipboard

Ignore tokens matching character ranges

Open voidless opened this issue 3 years ago • 4 comments

Hi! Is it possible to add an option to ignore character ranges for tokens? If the whole token matches one ignored character set then it will be skipped. This will still prevent mixed languages in a word but will ignore languages with different character sets.

We (unfortunately) write some comments and strings in Russian and it triggers a Spellr warning almost every time Simple dictionary checking doesn't work well with languages that has many cases (ex: Russian, Hindi) because you have to add all cases for each word to validate properly, and I was unable to find such dictionaries.

voidless avatar May 14 '21 14:05 voidless

I've found Russian dictionary with cases (35MB), it will work for our case

voidless avatar May 14 '21 14:05 voidless

hi did your found dictionary solve your problem? is it a public dictionary that i could link for others in the documentation? how is the performance of spellr with a 35MB wordlist?

robotdana avatar Jun 12 '21 08:06 robotdana

ignoring character range thing is interesting though, i'll look into that, because it's already a problem for chinese and other scripts that don't really use word breaks. it should be doable in the regex with ([[:alpha:]](?<!\p{Cyrillic}) or similar, i'll have a think about how to get that from the config to the regexes.

robotdana avatar Jun 12 '21 08:06 robotdana

I've used dictionary from this repo: https://github.com/danakt/russian-words 35MB is in unicode, original file was 2 times smaller in cp1251 encoding

Spellr completes in around 4 secs for 650k lines of code on my 6 core macbook

We are very happy with the results, now we spend less time on trivial errors during code review We even found a few errors in our localization files

voidless avatar Jun 15 '21 11:06 voidless