TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] Error while Fine-Tuning TTS for Japanese Language

Open mahimairaja opened this issue 4 months ago β€’ 4 comments

Describe the bug

It seems that there is hidden issue behind the dataset preparation for fine-tuning TTS on Japanese Language

To Reproduce

  1. Clone the repo and install the pacakges.
> git clone --branch xtts_demo -q https://github.com/coqui-ai/TTS.git

> pip install --use-deprecated=legacy-resolver -q -e TTS

> pip install --use-deprecated=legacy-resolver -q -r TTS/TTS/demos/xtts_ft_demo/requirements.txt

> pip install -q typing_extensions==4.8 numpy==1.26.2
  1. Launch the Fine-Tuning GUI
>  python TTS/TTS/demos/xtts_ft_demo/xtts_demo.py
  1. Add few Japanese Speech Audio samples to the dataset processing and click Create Dataset

  2. Move to the fine-tuning tab and run the training

And the error message pops up:

The training was interrupted due an error !! Please check the console to check the full error message! Error summary: Traceback (most recent call last): File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length) File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt train_samples, eval_samples = load_tts_samples( File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}" AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Expected behavior

The fine-tuning process should run, without interpretation.

Logs

>> DVAE weights restored from: /tmp/xtts_ft/run/training/XTTS_v2.0_original_model_files/dvae.pth
Traceback (most recent call last):
  File "/content/TTS/TTS/demos/xtts_ft_demo/xtts_demo.py", line 284, in train_model
    config_path, original_xtts_checkpoint, vocab_file, exp_path, speaker_wav = train_gpt(language, num_epochs, batch_size, grad_acumm, train_csv, eval_csv, output_path=output_path, max_audio_length=max_audio_length)
  File "/content/TTS/TTS/demos/xtts_ft_demo/utils/gpt_train.py", line 138, in train_gpt
    train_samples, eval_samples = load_tts_samples(
  File "/content/TTS/TTS/tts/datasets/__init__.py", line 121, in load_tts_samples
    assert len(meta_data_train) > 0, f" [!] No training samples found in {root_path}/{meta_file_train}"
AssertionError:  [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.20.6",
        "numpy": "1.26.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#1 SMP PREEMPT_DYNAMIC Sat Nov 18 15:31:17 UTC 2023"
    }
}

Additional context

No response

mahimairaja avatar Feb 17 '24 07:02 mahimairaja

The same error occurs in Chinese, the data preprocessing function doesn't seem to work with CJK characters.

jianchang512 avatar Feb 17 '24 11:02 jianchang512

Alright, does anyone already working on this issue?

mahimairaja avatar Feb 17 '24 12:02 mahimairaja

This website is also owned by Microsoft. You can give it a try

https://tts.byylook.com/ai/text-to-speech

rose07 avatar Feb 19 '24 02:02 rose07

This error message AssertionError: [!] No training samples found in /tmp/xtts_ft/dataset//tmp/xtts_ft/dataset/metadata_train.csv happens because the dataset processing didn't generate any dataset on which the fine-tuning process (next tab) relies.
Your dataset directory should have the following structure after the dataset processing is done.

image

where wavs directory contains all dataset divided into clips and metadata_eval.csv, metadata_train.csv maps these clips with their corresponding transcription or text see below where Arabic voices were used.

image

  • Check the quality of the input data. Try to provide high-quality audio files this helps in data processing.
  • Provide more samples of input data.
  • If you're using Whisper model to do the ASR process. Try a larger version of it.

zaher-m avatar Apr 19 '24 06:04 zaher-m