Cross-Speaker-Emotion-Transfer
Cross-Speaker-Emotion-Transfer copied to clipboard
The generated wav is not good
Hi, thank you for open source the wonderful work !
I followed your instructions 1) install lightconv_cuda
, 2) download the checkpoint, 3) download the speaker embedding npy.
However, the generated result is not good.
Below is my running command
python3 synthesize.py \
--text "Hello world" \
--speaker_id Actor_22 \
--emotion_id sad \
--restore_step 450000 \
--mode single \
--dataset RAVDESS
# sh run.sh
2022-11-30 13:45:22.626404: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Device of XSpkEmoTrans: cuda
Removing weight norm...
Raw Text Sequence: Hello world
Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}
ENV
python 3.6.8
fairseq 0.10.2
torch 1.7.0+cu110
CUDA 11.0
Hi @pangtouyuqqq , thanks for your attention. It is because of the dataset where there are only two different texts (It will give you more natural output when you try with one of them). If you need to generate unseen text, you may get some helps by training on other dataset which has more generic text-speech pairs. It would be also helpful to replace light convolution with transformer when you do that.