STT
STT copied to clipboard
Imbalanced quotation mark in Mozilla Common Voice Japanese Dataset
Summary
Mozilla Common Voice 11.0 Japanese dataset has unbalanced quotation mark that makes bin/import_cv2.py
panic.
Reproduction
$ python bin/import_cv2.py cv-corpus-11.0-2022-09-21/ja/ --validate_label_locale $SOMETHING
...
Loading TSV file: /mnt/ntfs/dev/voice/ja_simple/validated.tsv
Traceback (most recent call last):
File "bin/import_cv2.py", line 255, in <module>
main()
File "bin/import_cv2.py", line 250, in main
_preprocess_data(PARAMS.tsv_dir, audio_dir, PARAMS.space_after_every_character)
File "bin/import_cv2.py", line 196, in _preprocess_data
set_samples = _maybe_convert_set(
File "bin/import_cv2.py", line 130, in _maybe_convert_set
for row in reader:
File "/usr/lib/python3.8/csv.py", line 111, in __next__
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
...
Why does it happen?
cv-corpus-11.0-2022-09-21/ja/validated.tsv
has 4 lines that can potentially mess up csv
package's quotation handling.
$ % cat ../common-voice-filter/cv-corpus-11.0-2022-09-21/ja/validated.tsv | grep ' "'
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8 common_voice_ja_26130815.mp3 "では、危険だということですか?"と彼は武者震いをしながら言った。 2 0 twenties male ja
3447120ac93b7c7788687c259b7f55058804e4982c36174a9a0af762495a6c2310915d2b10562a1f75255d5b0a18eefb304ef7b042006d96d83158f22d238de8 common_voice_ja_26134634.mp3 "ローデシアから来たのを覚えているだろう」「なんてことだ、殺人犯め!」と彼は声を詰まらせた。 2 0 twenties male ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa common_voice_ja_26015806.mp3 "パン・アム・クリッパーコネクション" バナーのもと、定期通勤サービスを運営していた。 2 0 fourties female ja
02a8841a00d762472a4797b56ee01643e8d9ece5a225f2e91c007ab1f94c49c99e50d19986ff3fefb18190257323f34238828114aa607f84fbe9764ecf5aaeaa common_voice_ja_26127330.mp3 "もちろん違います。"ドロシーは答えました。 "私は何をすべきか?" 2 0 fourties female ja
Note that in the second occurrence, the quotation mark is not balanced. I assume it has something to do with Japanese typing system. Japanese language often uses 「」 instead of "", and it needs manual conversion, and for some reason it didn't get converted properly.
At the same time, python defaults double quotation mark as the quote character when parsing csv. So python tries to parse the file until the next quotation mark appears. The next occurrence is line 31236 (3712 lines later), thus the error message: _csv.Error: field larger than field limit (131072)
Fix
Do not use default quote character. In fact, do not worry about quotation at all when parsing csv. That is what Common Voice ToolBox Package is doing too