Cross-Speaker-Emotion-Transfer icon indicating copy to clipboard operation
Cross-Speaker-Emotion-Transfer copied to clipboard

The generated wav is not good

Open pangtouyuqqq opened this issue 1 year ago • 1 comments

Hi, thank you for open source the wonderful work ! I followed your instructions 1) install lightconv_cuda, 2) download the checkpoint, 3) download the speaker embedding npy. However, the generated result is not good.

Below is my running command

python3 synthesize.py \
  --text "Hello world" \
  --speaker_id Actor_22 \
  --emotion_id sad \
  --restore_step 450000 \
  --mode single \
  --dataset RAVDESS
# sh run.sh 
2022-11-30 13:45:22.626404: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Device of XSpkEmoTrans: cuda
Removing weight norm...
Raw Text Sequence: Hello world
Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}

ENV

python 3.6.8
fairseq                 0.10.2
torch                   1.7.0+cu110
CUDA 11.0

Hello world_Actor_22_sad

Hello world_Actor_22_sad.wav.zip

pangtouyuqqq avatar Nov 30 '22 13:11 pangtouyuqqq

Hi @pangtouyuqqq , thanks for your attention. It is because of the dataset where there are only two different texts (It will give you more natural output when you try with one of them). If you need to generate unseen text, you may get some helps by training on other dataset which has more generic text-speech pairs. It would be also helpful to replace light convolution with transformer when you do that.

keonlee9420 avatar Dec 01 '22 12:12 keonlee9420