STT icon indicating copy to clipboard operation
STT copied to clipboard

Imbalanced quotation mark in Mozilla Common Voice Japanese Dataset

Open calcoloergosum opened this issue 1 year ago • 1 comments

Summary

Mozilla Common Voice 11.0 Japanese dataset has unbalanced quotation mark that makes bin/import_cv2.py panic.

Reproduction

$ python bin/import_cv2.py cv-corpus-11.0-2022-09-21/ja/ --validate_label_locale $SOMETHING
...
Loading TSV file:  /mnt/ntfs/dev/voice/ja_simple/validated.tsv
Traceback (most recent call last):
  File "bin/import_cv2.py", line 255, in <module>
    main()
  File "bin/import_cv2.py", line 250, in main
    _preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
  File "bin/import_cv2.py", line 196, in _preprocess_data
    set_samples = _maybe_convert_set(
  File "bin/import_cv2.py", line 130, in _maybe_convert_set
    for row in reader:
  File "/usr/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)
...

Why does it happen?

cv-corpus-11.0-2022-09-21/ja/validated.tsv has 4 lines that can potentially mess up csv package's quotation handling.

$ % cat ../common-voice-filter/cv-corpus-11.0-2022-09-21/ja/validated.tsv | grep '     "'                      
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8        common_voice_ja_26130815.mp3    "では、危険だということですか?"と彼は武者震いをしながら言った。 2       0       twenties        male            ja
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8        common_voice_ja_26134634.mp3    "ローデシアから来たのを覚えているだろう」「なんてことだ、殺人犯め!」と彼は声を詰まらせた。       2       0       twenties        male            ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa        common_voice_ja_26015806.mp3    "パン・アム・クリッパーコネクション" バナーのもと、定期通勤サービスを運営していた。      2       0       fourties        female          ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa        common_voice_ja_26127330.mp3    "もちろん違います。"ドロシーは答えました。 "私は何をすべきか?"  2       0       fourties        female          ja

Note that in the second occurrence, the quotation mark is not balanced. I assume it has something to do with Japanese typing system. Japanese language often uses 「」 instead of "", and it needs manual conversion, and for some reason it didn't get converted properly.

At the same time, python defaults double quotation mark as the quote character when parsing csv. So python tries to parse the file until the next quotation mark appears. The next occurrence is line 31236 (3712 lines later), thus the error message: _csv.Error: field larger than field limit (131072)

Fix

Do not use default quote character. In fact, do not worry about quotation at all when parsing csv. That is what Common Voice ToolBox Package is doing too

calcoloergosum avatar Dec 11 '22 21:12 calcoloergosum