STT icon indicating copy to clipboard operation
STT copied to clipboard

Bug: Common Voice Import script ignoring capitalisation

Open xXaRoXx opened this issue 2 years ago • 0 comments

Describe the bug The 'import_cv2.py' script is ignoring the provided alphabet.txt File that contains Uppercase and lowercase characters.

To Reproduce Steps to reproduce the behavior:

  1. Donwload the (German Common Voice) Dataset.

  2. Run the following command: import_cv2.py --filter_alphabet GermanUppercaseLowercase.txt CommonVoice/cv-corpus-8.0-2022-01-19/de/ GermanUppercaseLowercase.txt

  3. The resulting csv files in CommonVoice/cv-corpus-8.0-2022-01-19/de/clips do not contain any uppercase characters at all. de0d6b9264f03245f5d6101f5eb1f979d7160466ea12d8569546ce84e1751a68beeb533f32d26ad93a70cd3d1bae29fb8b346ea64e672ff7e556b4f221e42611 common_voice_de_27627045.mp3 Im Alter von vier Jahren wurde sie für die "Sesamstraße" entdeckt. 2 0 de from train.tsv versus common_voice_de_27627045.wav,237356,im alter von vier jahren wurde sie für die sesamstraße entdeckt from train.csv

Expected behavior The resulting csv Files should have Uppercase characters or at least filter out that specific line because according to the documentation on the wiki This alphabet is used to exclude all audio files whose transcripts contain characters not in the specified alphabet. all entries that include a character not specified would be excluded as '"' and '.' were not in the alphabet.txt. But the specified 'ß' was included in the generated files. So it seems it just converts all text to lowercase?

Environment (please complete the following information): official docker image

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): official docker image run on Manjaro

  • TensorFlow installed from (our builds, or upstream TensorFlow): docker image

  • Python version: 3.6.9

  • GPU model and memory: Nvidia 1050 Ti 4Gb

  • Exact command to reproduce:

root@1361010df370:/code# bin/import_cv2.py --filter_alphabet /AI/alphabetGermanUppercaseL.txt /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/ 2022-03-11 16:48:51.918219: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. Loading TSV file: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/test.tsv Importing mp3 files... WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. Progress |############################################################################# | 99% completedImported 15375 samples. Skipped 381 samples that failed on transcript validation. Skipped 251 samples that were longer than 10 seconds. Final amount of imported audio: 25:28:03 from 26:54:46. Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/test.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/test.csv Progress |##############################################################################| 100% completed Loading TSV file: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/dev.tsv Importing mp3 files... WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. Progress |############################################################################# | 99% completedImported 15392 samples. Skipped 407 samples that failed on transcript validation. Skipped 208 samples that were longer than 10 seconds. Final amount of imported audio: 25:22:35 from 26:45:15. Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/dev.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/dev.csv Progress |##############################################################################| 100% completed Loading TSV file: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/train.tsv Importing mp3 files... WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. Progress |############################################################################# | 99% completedImported 405170 samples. Skipped 11129 samples that failed on transcript validation. Skipped 3864 samples that were longer than 10 seconds. Final amount of imported audio: 631:48:31 from 663:21:53. Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/train.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/train.csv Progress |##############################################################################| 100% completed Loading TSV file: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/validated.tsv Importing mp3 files... WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. Progress |##############################################################################| 100% completedImported 728713 samples. Skipped 13256 samples that failed on transcript validation. Skipped 4513 samples that were longer than 10 seconds. Final amount of imported audio: 1013:45:49 from 1050:27:28. Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/validated.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/validated.csv Progress |##############################################################################| 100% completed Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/train-all.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/train-all.csv Progress |##############################################################################| 100% completed Loading TSV file: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/other.tsv Importing mp3 files... WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. WARNING: No --validate_label_locale specified, you might end with inconsistent dataset. Progress |##############################################################################| 100% completedImported 4922 samples. Skipped 176 samples that failed on transcript validation. Skipped 1 samples that were too short to match the transcript. Skipped 16 samples that were longer than 10 seconds. Final amount of imported audio: 6:56:48 from 7:17:20. Saving new Coqui STT-formatted CSV file to: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/other.csv Writing CSV file for train.py as: /AI/CommonVoice/cv-corpus-8.0-2022-01-19/de/clips/other.csv Progress |##############################################################################| 100% completed

Additional context I want to train the model so that it can capitalise German Speech on its own but this seems to remove all the capitalisation before it can be trained on. Is this the right approach or did I miss something on the wiki?

train.tsv.zip train.csv.zip

xXaRoXx avatar Mar 11 '22 19:03 xXaRoXx