spellr
spellr copied to clipboard
Ignore tokens matching character ranges
Hi! Is it possible to add an option to ignore character ranges for tokens? If the whole token matches one ignored character set then it will be skipped. This will still prevent mixed languages in a word but will ignore languages with different character sets.
We (unfortunately) write some comments and strings in Russian and it triggers a Spellr warning almost every time Simple dictionary checking doesn't work well with languages that has many cases (ex: Russian, Hindi) because you have to add all cases for each word to validate properly, and I was unable to find such dictionaries.
I've found Russian dictionary with cases (35MB), it will work for our case
hi did your found dictionary solve your problem? is it a public dictionary that i could link for others in the documentation? how is the performance of spellr with a 35MB wordlist?
ignoring character range thing is interesting though, i'll look into that, because it's already a problem for chinese and other scripts that don't really use word breaks. it should be doable in the regex with ([[:alpha:]](?<!\p{Cyrillic})
or similar, i'll have a think about how to get that from the config to the regexes.
I've used dictionary from this repo: https://github.com/danakt/russian-words 35MB is in unicode, original file was 2 times smaller in cp1251 encoding
Spellr completes in around 4 secs for 650k lines of code on my 6 core macbook
We are very happy with the results, now we spend less time on trivial errors during code review We even found a few errors in our localization files