dictionaries
dictionaries copied to clipboard
Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option
Hunspell read the affix file byte by byte and decodes UTF-8 on demand. If it's not instructed to do so for flags, it doesn't. So non-ASCII characters like "ý" are treated like several characters, and due to another bug Hunspell silently takes just the first character and ignores the rest. So the words can have unexpected flags.
Example: pt
contains FORBIDDENWORD ý
, and the perfectly valid word trabalhar/akYMjLÀÚ
is treated as having this flag and thus considered misspelled.
Yeah good idea. I do remember thinking about this, but it never came up. Perhaps a send expression in crawl.sh could do the trick. PR welcome!
Yes, some combination of bash and unix text processing utilities should help. Neither of them are my strong side, so I wouldn't hold breath from a PR by me in the very near future :)
Shouldn’t this issue be about setting an SET UTF-8
instead of using a FLAG UTF-8
? 🤔
No, it's not enough. At the moment of submisson pt
already had SET UTF-8
, but Hunspell parses flags byte by byte, and needs to know that they're in UTF-8, too.
That sounds more complex than I thought...
But, then this is a bug in Portuguese though? It should either use ASCII flags, or SET UTF-8
?
Well, it was so. Now pt
already has FLAG UTF-8
, but there might be other dictionaries with this issue.
Hmm, that still seems like an issue for them though? That should be fixed in the upstream, rather than patched here?
The issue should be addressed where the dictionaries are converted into UTF-8. My understanding was that it was here, at least partly. If I'm mistaken, then this is a wrong repo indeed :)