dictionaries icon indicating copy to clipboard operation
dictionaries copied to clipboard

Specify FLAG UTF-8 when converting to UTF-8, if there was no explicit FLAG option

Open donnerpeter opened this issue 4 years ago • 8 comments

Hunspell read the affix file byte by byte and decodes UTF-8 on demand. If it's not instructed to do so for flags, it doesn't. So non-ASCII characters like "ý" are treated like several characters, and due to another bug Hunspell silently takes just the first character and ignores the rest. So the words can have unexpected flags.

Example: pt contains FORBIDDENWORD ý, and the perfectly valid word trabalhar/akYMjLÀÚ is treated as having this flag and thus considered misspelled.

donnerpeter avatar Feb 04 '21 18:02 donnerpeter

Yeah good idea. I do remember thinking about this, but it never came up. Perhaps a send expression in crawl.sh could do the trick. PR welcome!

wooorm avatar Feb 04 '21 18:02 wooorm

Yes, some combination of bash and unix text processing utilities should help. Neither of them are my strong side, so I wouldn't hold breath from a PR by me in the very near future :)

donnerpeter avatar Feb 04 '21 18:02 donnerpeter

Shouldn’t this issue be about setting an SET UTF-8 instead of using a FLAG UTF-8? 🤔

wooorm avatar Jun 23 '21 16:06 wooorm

No, it's not enough. At the moment of submisson pt already had SET UTF-8, but Hunspell parses flags byte by byte, and needs to know that they're in UTF-8, too.

donnerpeter avatar Jun 23 '21 17:06 donnerpeter

That sounds more complex than I thought...

But, then this is a bug in Portuguese though? It should either use ASCII flags, or SET UTF-8?

wooorm avatar Jun 23 '21 17:06 wooorm

Well, it was so. Now pt already has FLAG UTF-8, but there might be other dictionaries with this issue.

donnerpeter avatar Jun 23 '21 18:06 donnerpeter

Hmm, that still seems like an issue for them though? That should be fixed in the upstream, rather than patched here?

wooorm avatar Jun 23 '21 18:06 wooorm

The issue should be addressed where the dictionaries are converted into UTF-8. My understanding was that it was here, at least partly. If I'm mistaken, then this is a wrong repo indeed :)

donnerpeter avatar Jun 23 '21 19:06 donnerpeter