encodec-pytorch
encodec-pytorch copied to clipboard
unofficial implementation of the High Fidelity Neural Audio Compression
encodec-pytorch
This is an unofficial implementation of the paper High Fidelity Neural Audio Compression in PyTorch.
The LibriTTS960h 24khz encodec checkpoint and disc checkpoint is release in https://huggingface.co/zkniu/encodec-pytorch/tree/main
I hope we can get together to do something meaningful and rebuild encodec in this repo.
Introduction
This repository is based on encodec and EnCodec_Trainer.
Based on the EnCodec_Trainer, I have made the following changes:
- support multi-gpu training.
- support AMP training (you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8)
- support hydra configuration management.
- align the loss functions and hyperparameters.
- support warmup scheduler in training.
- support the test script to test the model.
- support tensorboard to monitor the training process.
TODO:
- [ ] support the 48khz model.
Enviroments
The code is tested on the following environment:
- Python 3.9
- PyTorch 2.0.0 / PyTorch 1.13
- GeForce RTX 3090 x 4 / V100-16G x 8 / A40 x 3
In order to you can run the code, you can install the environment by the help of requirements.txt.
Usage
Training
1. Prepare dataset
I use the librispeech as the train datasets and use the datasets/generate_train_file.py generate train csv which is used in the training process. You can check the datasets/generate_train_file.py and customAudioDataset.py to understand how to prepare your own dataset.
Also you can use ln -s to link the dataset to the datasets folder.
[Optional] Docker image
I provide a dockerfile to build a docker image with all the necessary dependencies.
- Building the image
docker build -t encodec:v1 .
- Using the image
# CPU running
docker run encodec:v1 <command> # you can add some parameters, such as -tid
# GPU running
docker run --gpus=all encodec:v1 <command>
2. Train
You can use the following command to train the model using multi gpu:
CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multi_gpu.py \
distributed.torch_distributed_debug=False \
distributed.find_unused_parameters=True \
distributed.world_size=4 \
common.save_interval=2 \
common.test_interval=2 \
common.max_epoch=100 \
datasets.tensor_cut=100000 \
datasets.batch_size=8 \
datasets.train_csv_path=YOUR TRAIN DATA.csv \
lr_scheduler.warmup_epoch=20 \
optimization.lr=5e-5 \
optimization.disc_lr=5e-5 \
Note:
- if you set a small
datasets.tensor_cut, you can set a largedatasets.batch_sizeto speed up the training process. - when you are training on your own dataset, I suggest you need to choose a moderate-length audio, because If you train your encodec with 1 senconds tensorcut in a small dataset and the encodec model dosen't perform well.
- if you encounter bug about
RuntimeError(f"Mismatch in number of params: ours is {len(params)}, at least one worker has a different one."). You can use a smalldatasets.tensor_cutto solve this problem. - if your torch version is lower 1.8, you need to check the default value of
torch.stft(return_complex)in theaudio_to_mel.py - if you encounter bug about multi-gpu training, you can try to set
distributed.torch_distributed_debug=Trueto get more message about this problem. - the single gpu training method is similar to the multi-gpu training method, you only need to set the
distributed.data_parallel=Falseparameter to the command, like this:python train_multi_gpu.py distributed.data_parallel=False common.save_interval=5 \ common.max_epoch=100 \ datasets.tensor_cut=72000 \ datasets.batch_size=4 \ datasets.train_csv_path=YOUR TRAIN DATA.csv \ lr_scheduler.warmup_epoch=10 \ optimization.lr=5e-5 \ optimization.disc_lr=5e-5 \ - the loss is not converged to zero, but the model can be used to compress and decompress the audio. you can use the
compression.shto test your model in every log_interval epoch. - the original paper dataset is larger than 17000h, but I only use LibriTTS960h to train the model, so the model is not good enough. If you want to train a better model, you can use the larger dataset.
- The code is not well tested, so there may be some bugs. If you encounter any problems, you can open an issue or contact me by email.
- When I add AMP training, I found the RVQ loss always be
nan, and I use L2 norm to normalized quantize and x, like the code -> actually, it's unstable.quantize = F.normalize(quantize) commit_loss = F.mse_loss(quantize.detach(), x) - When you try to use amp training, you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8
- I suggest you need to focus on the generator loss, the commit loss it could be not converge, you can check some objective metrics about pesq, stoi.
Test
I have add a shell script to compress and decompress the audio by different bandwidth, you can use the compression.sh to test your model.
The script can be used as follows:
sh compression.sh INPUT_WAV_FILE [MODEL_NAME] [CHECKPOINT]
- INPUT_WAV_FILE is the wav file you want to test
- MODEL_NAME is the model name, default is
encodec_24khz,supportencodec_48khz,my_encodec,encodec_bw - CHECKPOINT is the checkpoint path, when your MODEL_NAME is
my_encodec,you can point out the checkpoint
if you want to test the model at a specific bandwidth, you can use the following command:
python main.py -r -b [bandwidth] -f [INPUT_FILE] [OUTPUT_WAV_FILE] -m [MODEL_NAME] -c [CHECKPOINT]
main.py from the encodec , you can use the -h to check the help information.
Acknowledgement
Thanks to the following repositories:
- encodec
- EnCodec_Trainer
- melgan-neurips: audio_to_mel.py
LICENSE
The code is same as encodec LICENSE.