TTS [Bug] Error while Fine-Tuning TTS for Japanese Language

[Bug] Error while Fine-Tuning TTS for Japanese Language

Open mahimairaja opened this issue 4 months ago • 4 comments

Describe the bug

It seems that there is hidden issue behind the dataset preparation for fine-tuning TTS on Japanese Language

To Reproduce

Clone the repo and install the pacakges.

> git clone --branch xtts_demo -q https://github.com/coqui-ai/TTS.git

> pip install --use-deprecated=legacy-resolver -q -e TTS

> pip install --use-deprecated=legacy-resolver -q -r TTS/TTS/demos/xtts_ft_demo/requirements.txt

> pip install -q typing_extensions==4.8 numpy==1.26.2

Launch the Fine-Tuning GUI

>  python TTS/TTS/demos/xtts_ft_demo/xtts_demo.py

Add few Japanese Speech Audio samples to the dataset processing and click Create Dataset
Move to the fine-tuning tab and run the training

And the error message pops up:

The training was interrupted due an error !! Please check the console to check the full error message! Error summary: Traceback (most recent call last): File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length) File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt train_samples, eval_samples = load_tts_samples( File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}" AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Expected behavior

The fine-tuning process should run, without interpretation.

Logs

>> DVAE weights restored from: /tmp/xtts_ft/run/training/XTTS_v2.0_original_model_files/dvae.pth
Traceback (most recent call last):
  File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
  File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt
    train_samples, eval_samples = load_tts_samples(
  File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples
    assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}"
AssertionError:  [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.20.6",
        "numpy": "1.26.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023"
    }
}

Additional context

No response

Feb 17 '24 07:02 mahimairaja

The same error occurs in Chinese, the data preprocessing function doesn't seem to work with CJK characters.

Feb 17 '24 11:02 jianchang512

Alright, does anyone already working on this issue?

Feb 17 '24 12:02 mahimairaja

This website is also owned by Microsoft. You can give it a try

https://tts.byylook.com/ai/text-to-speech

Feb 19 '24 02:02 rose07

This error message AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv happens because the dataset processing didn't generate any dataset on which the fine-tuning process (next tab) relies.
Your dataset directory should have the following structure after the dataset processing is done.

where wavs directory contains all dataset divided into clips and metadata_eval.csv, metadata_train.csv maps these clips with their corresponding transcription or text see below where Arabic voices were used.

Check the quality of the input data. Try to provide high-quality audio files this helps in data processing.
Provide more samples of input data.
If you're using Whisper model to do the ASR process. Try a larger version of it.

Apr 19 '24 06:04 zaher-m

TTS TTS copied to clipboard

[Bug] Error while Fine-Tuning TTS for Japanese Language

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

TTS
TTS copied to clipboard