marian-dev
marian-dev copied to clipboard
vocab files are malformed Yaml
I find that vocab files in Yaml format created with Marian are malformed. This makes it hard to convert them to other formats, e.g. using Python:
>>> x = yaml.load(open('vocab.src.yml','r', encoding='utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 2364: invalid start byte
What does python say if you process your corpus and decode from utf-8? If it croaks there too, your corpus contains malformed utf-8. Although there is also the chance, that python has broken utf-8 support. I always find it really difficult to work with text files in python and utf-8 compared for instance to perl.
@frankseide do we care about this one?
Yes. We do have corpora that have malformed UTF-8 characters. That is fine if we generate plain-text vocabs, but we should not create malformed files as a principle.
I don't know how to mitigate it. Maybe The correct solution is to reject malformed corpora in the first place if one wants to use Yaml vocabs.
Howdy!
Marian yaml is still malformed, which wasn't a biggie till I tried to parse it outside Marian. It didn't go well... Also, it is inconvenient is that it is not documented what sort of yaml standart marian vocab produces. Lastly - the branch supporting factors comes with some fsv parser, that doesn't understand marian-yaml double escaped quote entries for example:
"\"": 288
Hi, looks like different issues. We don't validate the utf-8 correctness of your corpus, so the yaml files may be invalid, here I would say having clean corpora is on you. As for the fsv, that's not a yaml file, no?
I just realized that I might have been wrong about the malformed yaml. Cheers tho!