piper icon indicating copy to clipboard operation
piper copied to clipboard

TRAINING.MD update for correct setup given dependency requirements for older python version (python3.10-venv & pip==24.0 inside venv)

Open robit-man opened this issue 8 months ago • 2 comments

Training Guide

Check out a video training guide by Thorsten Müller

For Windows, see ssamjh's guide using WSL


Training a voice for Piper involves 3 main steps:

  1. Preparing the dataset
  2. Training the voice model
  3. Exporting the voice model

Choices must be made at each step, including:

Getting Started

Start by installing system dependencies:

sudo apt install python3.10-dev && sudo apt install python3.10-venv

If those fail, be sure to add the deprecated python apt repository and try again

sudo add-apt-repository ppa:deadsnakes/ppa

Then create a Python virtual environment:

cd piper/src/python
python3.10 -m venv .venv
source .venv/bin/activate
pip install pip==24.0
pip3 install --upgrade wheel setuptools
pip3 install -e .

Run the build_monotonic_align.sh script in the src/python directory to build the extension.

Ensure you have espeak-ng installed (sudo apt-get install espeak-ng).

Preparing a Dataset

The Piper training scripts expect two files that can be generated by python3 -m piper_train.preprocess:

  • A config.json file with the voice settings
    • audio (required)
      • sample_rate - audio rate in hertz
    • espeak (required)
      • language - espeak-ng voice or alphabet
    • num_symbols (required)
      • Number of phonemes in the model (typically 256)
    • num_speakers (required)
      • Number of speakers in the dataset
    • phoneme_id_map (required)
      • Map from a phoneme (UTF-8 codepoint) to a list of ids
      • Id 0 ("_") is padding (pad)
      • Id 1 ("^") is the beginning of an utterance (bos)
      • Id 2 ("$") is the end of an utterance (eos)
      • Id 3 (" ") is a word separator (whitespace)
    • phoneme_type
      • "espeak" or "text"
      • "espeak" phonemes use espeak-ng
      • "text" phonemes use a pre-defined alphabet
    • speaker_id_map
      • Map from a speaker name to id
    • phoneme_map
      • Map from a phoneme (UTF-8 codepoint) to a list of phonemes
    • inference
      • noise_scale - noise added to the generator (default: 0.667)
      • length_scale - speaking speed (default: 1.0)
      • noise_w - phoneme width variation (default: 0.8)
  • A dataset.jsonl file with one line per utterance (JSON objects)
    • phoneme_ids (required)
      • List of ids for each utterance phoneme (0 <= id < num_symbols)
    • audio_norm_path (required)
    • audio_spec_path (required)
    • speaker_id (required for multi-speaker)
      • Id of the utterance's speaker (0 <= id < num_speakers)
    • audio_path
      • Absolute path to original audio file
    • text
      • Original text of utterance before phonemization
    • phonemes
      • Phonemes from utterance text before converting to ids
    • speaker
      • Name of utterance speaker (from speaker_id_map)

Dataset Format

The pre-processing script expects data to be a directory with:

  • metadata.csv - CSV file with text, audio filenames, and speaker names
  • wav/ - directory with audio files

The metadata.csv file uses | as a delimiter, and has 2 or 3 columns depending on if the dataset has a single or multiple speakers. There is no header row.

For single speaker datasets:

id|text

where id is the name of the WAV file in the wav directory. For example, an id of 1234 means that wav/1234.wav should exist.

For multi-speaker datasets:

id|speaker|text

where speaker is the name of the utterance's speaker. Speaker ids will automatically be assigned based on the number of utterances per speaker (speaker id 0 has the most utterances).

Pre-processing

An example of pre-processing a single speaker dataset:

python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir /path/to/dataset_dir/ \
  --output-dir /path/to/training_dir/ \
  --dataset-format ljspeech \
  --single-speaker \
  --sample-rate 22050

The --language argument refers to an espeak-ng voice by default, such as de for German.

To pre-process a multi-speaker dataset, remove the --single-speaker flag and ensure that your dataset has the 3 columns: id|speaker|text Verify the number of speakers in the generated config.json file before proceeding.

Training a Model

Once you have a config.json, dataset.jsonl, and audio files (.pt) from pre-processing, you can begin the training process with python3 -m piper_train

For most cases, you should fine-tune from an existing model. The model must have the sample audio quality and sample rate, but does not necessarily need to be in the same language.

It is highly recommended to train with the following Dockerfile:

FROM nvcr.io/nvidia/pytorch:22.03-py3

RUN pip3 install \
    'pytorch-lightning'

ENV NUMBA_CACHE_DIR=.numba_cache

As an example, we will fine-tune the medium quality lessac voice. Download the .ckpt file and run the following command in your training environment:

python3 -m piper_train \
    --dataset-dir /path/to/training_dir/ \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.0 \
    --num-test-examples 0 \
    --max_epochs 10000 \
    --resume_from_checkpoint /path/to/lessac/epoch=2164-step=1355540.ckpt \
    --checkpoint-epochs 1 \
    --precision 32

Use --quality high to train a larger voice model (sounds better, but is much slower).

You can adjust the validation split (5% = 0.05) and number of test examples for your specific dataset. For fine-tuning, they are often set to 0 because the target dataset is very small.

Batch size can be tricky to get right. It depends on the size of your GPU's vRAM, the model's quality/size, and the length of the longest sentence in your dataset. The --max-phoneme-ids <N> argument to piper_train will drop sentences that have more than N phoneme ids. In practice, using --batch-size 32 and --max-phoneme-ids 400 will work for 24 GB of vRAM (RTX 3090/4090).

Multi-Speaker Fine-Tuning

If you're training a multi-speaker model, use --resume_from_single_speaker_checkpoint instead of --resume_from_checkpoint. This will be much faster than training your multi-speaker model from scratch.

Testing

To test your voice during training, you can use these test sentences or generate your own with piper-phonemize. Run the following command to generate audio files:

cat test_en-us.jsonl | \
    python3 -m piper_train.infer \
        --sample-rate 22050 \
        --checkpoint /path/to/training_dir/lightning_logs/version_0/checkpoints/*.ckpt \
        --output-dir /path/to/training_dir/output"

The input format to piper_train.infer is the same as dataset.jsonl: one line of JSON per utterance with phoneme_ids and speaker_id (multi-speaker only). Generate your own test file with piper-phonemize:

lib/piper_phonemize -l en-us --espeak-data lib/espeak-ng-data/ < my_test_sentences.txt > my_test_phonemes.jsonl

Tensorboard

Check on your model's progress with tensorboard:

tensorboard --logdir /path/to/training_dir/lightning_logs

Click on the scalars tab and look at both loss_disc_all and loss_gen_all. In general, the model is "done" when loss_disc_all levels off. We've found that 2000 epochs is usually good for models trained from scratch, and an additional 1000 epochs when fine-tuning.

Exporting a Model

When your model is finished training, export it to onnx with:

python3 -m piper_train.export_onnx \
    /path/to/model.ckpt \
    /path/to/model.onnx
    
cp /path/to/training_dir/config.json \
   /path/to/model.onnx.json

The export script does additional optimization of the model with onnx-simplifier.

If the export is successful, you can now use your voice with Piper:

echo 'This is a test.' | \
  piper -m /path/to/model.onnx --output_file test.wav

robit-man avatar Jul 02 '25 17:07 robit-man

Hey robit,

During preprocessing I get this message several times

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/pipertts/piper/src/python/piper_train/preprocess.py", line 502, in <module>
    main()
  File "/data/pipertts/piper/src/python/piper_train/preprocess.py", line 219, in main
    proc.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/pipertts/piper/src/python/piper_train/preprocess.py", line 315, in phonemize_batch_espeak
    utt.audio_norm_path, utt.audio_spec_path = cache_norm_audio(
  File "/data/pipertts/piper/src/python/piper_train/norm_audio/__init__.py", line 73, in cache_norm_audio
    audio_norm_tensor = torch.FloatTensor(audio_norm_array).unsqueeze(0)
/data/pipertts/piper/src/python/piper_train/norm_audio/__init__.py:73: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:77.)
  audio_norm_tensor = torch.FloatTensor(audio_norm_array).unsqueeze(0)
pip install 'numpy>=1.19.0,<2.0'

should help.

AndrewSteel avatar Jul 03 '25 15:07 AndrewSteel

Hey @robit-man,

there ist still the error

ImportError: cannot import name '_compare_version' from 'torchmetrics.utilities.imports' (/data/pipertts/pipervenv/lib/python3.10/site-packages/torchmetrics/utilities/imports.py)
pip install torchmetrics==0.11.4

should help.

AndrewSteel avatar Jul 05 '25 19:07 AndrewSteel