marian-dev vocab files are malformed Yaml

I find that vocab files in Yaml format created with Marian are malformed. This makes it hard to convert them to other formats, e.g. using Python:

>>> x = yaml.load(open('vocab.src.yml','r', encoding='utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
...

  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 2364: invalid start byte

May 17 '18 02:05 frankseide

What does python say if you process your corpus and decode from utf-8? If it croaks there too, your corpus contains malformed utf-8. Although there is also the chance, that python has broken utf-8 support. I always find it really difficult to work with text files in python and utf-8 compared for instance to perl.

May 17 '18 02:05 emjotde

@frankseide do we care about this one?

Nov 07 '18 17:11 emjotde

Yes. We do have corpora that have malformed UTF-8 characters. That is fine if we generate plain-text vocabs, but we should not create malformed files as a principle.

I don't know how to mitigate it. Maybe The correct solution is to reject malformed corpora in the first place if one wants to use Yaml vocabs.

Nov 07 '18 18:11 frankseide

Howdy! Marian yaml is still malformed, which wasn't a biggie till I tried to parse it outside Marian. It didn't go well... Also, it is inconvenient is that it is not documented what sort of yaml standart marian vocab produces. Lastly - the branch supporting factors comes with some fsv parser, that doesn't understand marian-yaml double escaped quote entries for example: "\"": 288

May 05 '21 11:05 tomsbergmanis

Hi, looks like different issues. We don't validate the utf-8 correctness of your corpus, so the yaml files may be invalid, here I would say having clean corpora is on you. As for the fsv, that's not a yaml file, no?

May 05 '21 15:05 emjotde

I just realized that I might have been wrong about the malformed yaml. Cheers tho!

May 06 '21 11:05 tomsbergmanis

marian-dev marian-dev copied to clipboard

vocab files are malformed Yaml

marian-dev
marian-dev copied to clipboard