seq2seq-vc icon indicating copy to clipboard operation
seq2seq-vc copied to clipboard

Process for training the three techniques described by any source speaker.

Open PAAYAS opened this issue 1 year ago • 13 comments

Hello, @unilight As part of my work on the LSC model, I want to convert the accent of an other speaker—let's say ASI from the L2ARCTIC dataset—to BDL (the target). The pretrained models that you have provided appear to have been trained using TXHC speakers. Could you give us a thorough training approach so we can train the non-parallel frame-based VCF model or vocoders on any source speakers?

I look forward to hearing from you as soon as possible. I'm grateful.

PAAYAS avatar Jun 25 '24 18:06 PAAYAS

If you could assist me with training the LSC or cascade model for a different source speaker, that would be greatly appreciated.

PAAYAS avatar Jun 25 '24 18:06 PAAYAS

Hi @PAAYAS, can you try to follow the instructions in the readme here: https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic/lsc, and then see if you have any problem? If you only want to convert "from" a new speaker, it's actually quite simple -- you only need to train the seq2seq model. (If you want to convert "to" a new speaker it's much more troublesome.)

unilight avatar Jun 26 '24 01:06 unilight

Hello @unilight I used an ASI speaker (Source) to train the LSC model, and I set the target to BDL. The voice of the TXHC speaker appeared in the wav files along with artifacts that were produced during the decoding process, rather than the ASI speaker. Could you let me know if I should train any other models with ASI speaker. Thank you.

PAAYAS avatar Jun 26 '24 03:06 PAAYAS

ASI_BDL_LSC.zip Here by I am providing some of the results which I obtained while decoding.

It seems that the decoder which we are using is ppg_sxliu_decoder_THXC and for vocoder is pwg_TXHC (Pretrained models) in LSC model during conversion stage. That seems to be an issue while we convert from a new speaker.

PAAYAS avatar Jun 26 '24 04:06 PAAYAS

@PAAYAS The current methods (all three) cannot convert from a specific new speaker without re-training (or fine-tuning) using the data from that new speaker.

unilight avatar Jun 26 '24 05:06 unilight

@unilight I see now. Could you could assist me with how to fine-tune for a new speaker or retrain models?

PAAYAS avatar Jun 26 '24 06:06 PAAYAS

@PAAYAS Please try to follow the instructions in the readme here: https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic/lsc.

unilight avatar Jun 26 '24 06:06 unilight

@unilight Thank you, will look once again.

PAAYAS avatar Jun 26 '24 07:06 PAAYAS

Greetings, @unilight. As you indicated in https://github.com/unilight/seq2seq-vc/tree/main/egs/l2-arctic, you are employing the S3PRL-VC toolbox for non-parallel frame-based VC model training. Could you please help me with my own dataset training?

PAAYAS avatar Jun 26 '24 07:06 PAAYAS

You can try to follow the instructions at https://github.com/unilight/s3prl-vc/tree/main/egs/TEMPLATE/a2o_vc.

unilight avatar Jun 26 '24 07:06 unilight

Hello @unilight, thank you for your time, I was able to train the non-parallel frame-based VC model on my dataset, but the waveform produced while decoding seems not capturing speaker identity. Could you help me on how we can train the vocoder model on any source speaker?

PAAYAS avatar Jun 26 '24 18:06 PAAYAS

Greetings, @unilight Could you please provide me with instructions on how to convert an accent from multiple speakers to one target speaker?

PAAYAS avatar Jul 08 '24 04:07 PAAYAS

It's not quite possible with the functions provided in this repo.

unilight avatar Jul 08 '24 04:07 unilight