fish-diffusion
fish-diffusion copied to clipboard
An easy to understand TTS / SVS / SVC framework
An easy to understand TTS / SVS / SVC training framework.
Check our Wiki to get started!
中文文档
Terms of Use for Fish Diffusion
-
Obtaining Authorization and Intellectual Property Infringement: The user is solely accountable for acquiring the necessary authorization for any datasets utilized in their training process and assumes full responsibility for any infringement issues arising from the utilization of the input source. Fish Diffusion and its developers disclaim all responsibility for any complications that may emerge due to the utilization of unauthorized datasets.
-
Proper Attribution: Any derivative works based on Fish Diffusion must explicitly acknowledge the project and its license. In the event of distributing Fish Diffusion's code or disseminating results generated by this project, the user is obliged to cite the original author and source code (Fish Diffusion).
-
Audiovisual Content and AI-generated Disclosure: All derivative works created using Fish Diffusion, including audio or video materials, must explicitly acknowledge the utilization of the Fish Diffusion project and declare that the content is AI-generated. If incorporating videos or audio published by third parties, the original links must be furnished.
-
Agreement to Terms: By persisting in the use of Fish Diffusion, the user unequivocally consents to the terms and conditions delineated in this document. Neither Fish Diffusion nor its developers shall be held liable for any subsequent difficulties that may transpire.
Summary
Using Diffusion Model to solve different voice generating tasks. Compared with the original diffsvc repository, the advantages and disadvantages of this repository are as follows:
- Support multi-speaker
- The code structure of this repository is simpler and easier to understand, and all modules are decoupled
- Support 44.1khz Diff Singer community vocoder
- Support multi-machine multi-devices training, support half-precision training, save your training speed and memory
Preparing the environment
The following commands need to be executed in the conda environment of python 3.10
# Install PyTorch related core dependencies, skip if installed
# Reference: https://pytorch.org/get-started/locally/
conda install "pytorch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" pytorch-cuda=11.8 -c pytorch -c nvidia
# Install PDM dependency management tool, skip if installed
# Reference: https://pdm.fming.dev/latest/
curl -sSL https://raw.githubusercontent.com/pdm-project/pdm/main/install-pdm.py | python3 -
# Install the project dependencies
pdm sync
Vocoder preparation
Fish Diffusion requires the FishAudio NSF-HiFiGAN vocoder to generate audio.
Automatic download
python tools/download_nsf_hifigan.py
If you are using the script to download the model, you can use the --agree-license
parameter to agree to the CC BY-NC-SA 4.0 license.
python tools/download_nsf_hifigan.py --agree-license
Manual download
Download and unzip nsf_hifigan-stable-v1.zip
from Fish Diffusion Release
Copy the nsf_hifigan
folder to the checkpoints
directory (create if not exist)
If you want to download ContentVec manually, you can download it from here and put it in the checkpoints
directory.
Dataset preparation
You only need to put the dataset into the dataset
directory in the following file structure
dataset
├───train
│ ├───xxx1-xxx1.wav
│ ├───...
│ ├───Lxx-0xx8.wav
│ └───speaker0 (Subdirectory is also supported)
│ └───xxx1-xxx1.wav
└───valid
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
# Extract all data features, such as pitch, text features, mel features, etc.
python tools/preprocessing/extract_features.py --config configs/svc_hubert_soft.py --path dataset --clean
Baseline training
The project is under active development, please backup your config file
The project is under active development, please backup your config file
The project is under active development, please backup your config file
# Single machine single card / multi-card training
python tools/diffusion/train.py --config configs/svc_hubert_soft.py
# Multi-node training
python tools/diffusion/train.py --config configs/svc_content_vec_multi_node.py
# Environment variables need to be defined on each node,please see https://pytorch-lightning.readthedocs.io/en/1.6.5/clouds/cluster.html for more information.
# Resume training
python tools/diffusion/train.py --config configs/svc_hubert_soft.py --resume [checkpoint file]
# Fine-tune the pre-trained model
# Note: You should adjust the learning rate scheduler in the config file to warmup_cosine_finetune
python tools/diffusion/train.py --config configs/svc_cn_hubert_soft_finetune.py --pretrained [checkpoint file]
Inference
# Inference using shell, you can use --help to view more parameters
python tools/diffusion/inference.py --config [config] \
--checkpoint [checkpoint file] \
--input [input audio] \
--output [output audio]
# Gradio Web Inference, other parameters will be used as gradio default parameters
python tools/diffusion/inference.py --config [config] \
--checkpoint [checkpoint file] \
--gradio
Convert a DiffSVC model to Fish Diffusion
python tools/diffusion/diff_svc_converter.py --config configs/svc_hubert_soft_diff_svc.py \
--input-path [DiffSVC ckpt] \
--output-path [Fish Diffusion ckpt]
Contributing
If you have any questions, please submit an issue or pull request.
You should run pdm run lint
before submitting a pull request.
Real-time documentation can be generated by
pdm run docs
Credits
- diff-svc original
- diff-svc optimized
- DiffSinger Paper
- so-vits-svc
- iSTFTNet Paper
- CookieTTS
- HiFi-GAN Paper
- Retrieval-based-Voice-Conversion