marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

vocab files are malformed Yaml

Open frankseide opened this issue 6 years ago • 6 comments

I find that vocab files in Yaml format created with Marian are malformed. This makes it hard to convert them to other formats, e.g. using Python:

>>> x = yaml.load(open('vocab.src.yml','r', encoding='utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
...

  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 2364: invalid start byte

frankseide avatar May 17 '18 02:05 frankseide

What does python say if you process your corpus and decode from utf-8? If it croaks there too, your corpus contains malformed utf-8. Although there is also the chance, that python has broken utf-8 support. I always find it really difficult to work with text files in python and utf-8 compared for instance to perl.

emjotde avatar May 17 '18 02:05 emjotde

@frankseide do we care about this one?

emjotde avatar Nov 07 '18 17:11 emjotde

Yes. We do have corpora that have malformed UTF-8 characters. That is fine if we generate plain-text vocabs, but we should not create malformed files as a principle.

I don't know how to mitigate it. Maybe The correct solution is to reject malformed corpora in the first place if one wants to use Yaml vocabs.

frankseide avatar Nov 07 '18 18:11 frankseide

Howdy! Marian yaml is still malformed, which wasn't a biggie till I tried to parse it outside Marian. It didn't go well... Also, it is inconvenient is that it is not documented what sort of yaml standart marian vocab produces. Lastly - the branch supporting factors comes with some fsv parser, that doesn't understand marian-yaml double escaped quote entries for example: "\"": 288

tomsbergmanis avatar May 05 '21 11:05 tomsbergmanis

Hi, looks like different issues. We don't validate the utf-8 correctness of your corpus, so the yaml files may be invalid, here I would say having clean corpora is on you. As for the fsv, that's not a yaml file, no?

emjotde avatar May 05 '21 15:05 emjotde

I just realized that I might have been wrong about the malformed yaml. Cheers tho!

tomsbergmanis avatar May 06 '21 11:05 tomsbergmanis