Real-Time-Voice-Cloning icon indicating copy to clipboard operation
Real-Time-Voice-Cloning copied to clipboard

Fixing neonsecret fork with RUS support (newbie questions)

Open vorob1 opened this issue 3 years ago • 25 comments
trafficstars

**see my fork https://github.com/neonsecret/Real-Time-Voice-Cloning-Multilang it is adjusted to train the bilingual ru+en model and is easily adjustable for adding new languages

Originally posted by @neonsecret in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/707#issuecomment-1100036701**

!!! Had to create thread here cause i don't see issues tab on his fork page !!!

Sir, that's exactly what i'm looking for. I wanna correct some wrong voiceover in old game, but since i can't get in touch with actor i want to simulate his voice.

The subj tool works, but can't do russian voice https://youtu.be/lDbpoaaBJSo Your fork gives me errors:

PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python demo_toolbox.py Traceback (most recent call last): File "demo_toolbox.py", line 7, in from utils.default_models import ensure_default_models ModuleNotFoundError: No module named 'utils.default_models'

My knowledge on all these python stuff is low so i just copy paste commands, sometimes try to understand its errors, but this looks unsolvable with my level of knowledge.

I want simple thing, launch GUI, point program to WAV files with actor voice, enter text and get voiceover files :)

I also tried python demo_cli.py, got lot's of stuff but in the end it was this:

FileNotFoundError: [Errno 2] No such file or directory: 'saved_models\rusmodeltweaked\synthesizer.pt'


Okay i managed to turn on toolbox by copying some files from original build, now when i add wav and try synth +vocode i get this error:

size mismatch for encoder.embeddingweight: copying a param with shape torch.5ize([66, 512]) from chequoint, the shape in current model is tord1.Size([194, 512]).

vorob1 avatar May 27 '22 10:05 vorob1

okay l will look into it, but don't copy files from original repo, it doesn't work that way

neonsecret avatar May 27 '22 10:05 neonsecret

by the way, the error you had about no such file or directory means you had to download a pretrained model, I will update Readme to add a link

neonsecret avatar May 27 '22 10:05 neonsecret

I've redownload zip file. Now GUI launch file is missing, no demo_toolbox.py file.

I've tried demo_cli, but it shows same error about size i previously had in gui:

_PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python demo_cli.py Arguments: enc_model_fpath: saved_models\default\encoder.pt syn_model_fpath: saved_models\rusmodeltweaked\synthesizer.pt voc_model_fpath: saved_models\default\vocoder.pt cpu: False no_sound: False seed: None

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.

Preparing the encoder, the synthesizer and the vocoder... Loaded encoder "encoder.pt" trained to step 1564501 Synthesizer using device: cuda Building Wave-RNN Trainable Parameters: 4.481M Loading model weights at saved_models\default\vocoder.pt Testing your configuration with small inputs. Testing the encoder... Testing the synthesizer... Trainable Parameters: 30.936M Traceback (most recent call last): File "demo_cli.py", line 95, in mels = synthesizer.synthesize_spectrograms(texts, embeds) File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron\inference.py", line 88, in synthesize_spectrograms self.load() File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron\inference.py", line 65, in load self.model.load(self.model_fpath) File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron\tacotron.py", line 506, in load self.load_state_dict(checkpoint["model_state"]) File "C:\Users\babud\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 1498, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([194, 512]).

p.s. i was downloading pretrained models from readme (https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models) but aren't they from english-only release? Maybe you can upload russian one? Thanks!

vorob1 avatar May 27 '22 19:05 vorob1

1.delete your repository and clone it again 2. only synthesizer matters for language 3. the synthetizer I've provided might not have the best audio quality, it's a lot of trial and error to train a good model

neonsecret avatar May 27 '22 19:05 neonsecret

image Looks like here on synth is not a hyper link. And demo_toolbox.py is still missing.

vorob1 avatar May 27 '22 19:05 vorob1

Checked readme file and link was there, solved :) https://drive.google.com/file/d/1qtGH8JzoY_v3h1v_zQyTSWiAW4bWYXsU/view?usp=sharing

vorob1 avatar May 27 '22 19:05 vorob1

sadly demo_cli.py gives me this error:

PS C:\Users\babud\Downloads\RRUS> python .\demo_cli.py
Arguments:
    enc_model_fpath:   saved_models\default\encoder.pt
    syn_model_fpath:   saved_models\rusmodeltweaked\synthesizer.pt
    voc_model_fpath:   saved_models\default\vocoder.pt
    cpu:               False
    no_sound:          False
    seed:              None

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.

Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models\default\vocoder.pt
Testing your configuration with small inputs.
        Testing the encoder...
        Testing the synthesizer...
Trainable Parameters: 30.936M
Loaded synthesizer "synthesizer.pt" trained to step 68800

| Generating 1/1
Traceback (most recent call last):
  File ".\demo_cli.py", line 95, in <module>
    mels = synthesizer.synthesize_spectrograms(texts, embeds)
  File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in synthesize_spectrograms
    text_lens = [len(text) for text in batch]
  File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in <listcomp>
    text_lens = [len(text) for text in batch]
TypeError: object of type 'int' has no len()

vorob1 avatar May 27 '22 19:05 vorob1

In line 109: convert it to a string.

   self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)

try converting it to (len(str

len() function cannot be called with an integer, try dir or range if it does not work.

TrycsPublic avatar May 30 '22 11:05 TrycsPublic

In line 109: convert it to a string.

   self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)

try converting it to (len(str

len() function cannot be called with an integer, try dir or range if it does not work.

no its not that, I'm working on the project now, I will reply in this issue when I'm done and did all the tests

neonsecret avatar May 30 '22 11:05 neonsecret

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.

Its recommended to use a cloud GPU provider training on a laptop is slow, slower with a compacted one.

TrycsPublic avatar May 31 '22 23:05 TrycsPublic

sadly demo_cli.py gives me this error:

PS C:\Users\babud\Downloads\RRUS> python .\demo_cli.py
Arguments:
    enc_model_fpath:   saved_models\default\encoder.pt
    syn_model_fpath:   saved_models\rusmodeltweaked\synthesizer.pt
    voc_model_fpath:   saved_models\default\vocoder.pt
    cpu:               False
    no_sound:          False
    seed:              None

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.

Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models\default\vocoder.pt
Testing your configuration with small inputs.
        Testing the encoder...
        Testing the synthesizer...
Trainable Parameters: 30.936M
Loaded synthesizer "synthesizer.pt" trained to step 68800

| Generating 1/1
Traceback (most recent call last):
  File ".\demo_cli.py", line 95, in <module>
    mels = synthesizer.synthesize_spectrograms(texts, embeds)
  File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in synthesize_spectrograms
    text_lens = [len(text) for text in batch]
  File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in <listcomp>
    text_lens = [len(text) for text in batch]
TypeError: object of type 'int' has no len()

Did you add all the russian letters in the newly cloned repo?

TrycsPublic avatar May 31 '22 23:05 TrycsPublic

What do you mean? Where do i need to add russian letters and why?

vorob1 avatar Jun 03 '22 08:06 vorob1

okay, readme updated, files updated, model trained, everything should work fine except for the audio quality, model still needs finetuning, so feel free to delete everything from the directory and clone from scratch

neonsecret avatar Jun 04 '22 19:06 neonsecret

demo_cli also updated, found the bug

neonsecret avatar Jun 05 '22 06:06 neonsecret

sound quality is very poor

davidhhh123 avatar Jun 05 '22 10:06 davidhhh123

sound quality is very poor

yes, the model isn't trained that good yet, I'm working on it

neonsecret avatar Jun 05 '22 10:06 neonsecret

and he can synthesize Russian speech?

davidhhh123 avatar Jun 05 '22 10:06 davidhhh123

Sadly it still gives me this error:

`PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python .\demo_cli.py Arguments: enc_model_fpath: saved_models\default\encoder.pt syn_model_fpath: saved_models\default\synthesizer.pt voc_model_fpath: saved_models\default\vocoder.pt cpu: False no_sound: False seed: None

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.

Preparing the encoder, the synthesizer and the vocoder... Loaded encoder "encoder.pt" trained to step 1564501 Synthesizer using device: cuda Building Wave-RNN Trainable Parameters: 4.481M Loading model weights at saved_models\default\vocoder.pt Testing your configuration with small inputs. Testing the encoder... Testing the synthesizer... Trainable Parameters: 30.936M Traceback (most recent call last): File ".\demo_cli.py", line 95, in mels = synthesizer.synthesize_spectrograms(texts, embeds) File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron_tweaked\inference.py", line 87, in synthesize_spectrograms self.load() File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron_tweaked\inference.py", line 65, in load self._model.load(self.model_fpath) File "C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master\synthesizer\models\tacotron_tweaked\tacotron.py", line 499, in load self.load_state_dict(checkpoint["model_state"]) File "C:\Users\babud\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 1498, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Tacotron: size mismatch for encoder.embedding.weight: copying a param with shape torch.Size([66, 512]) from checkpoint, the shape in current model is torch.Size([194, 512]).`

vorob1 avatar Jun 06 '22 08:06 vorob1

you downloaded the wrong model, be careful, download the synthesizer from the Google drive link from readme and overwrite the file. and did you clone the repository from scratch after the recent fixes?

neonsecret avatar Jun 06 '22 10:06 neonsecret

and put it into "default" folder of the "saved_models" folder

neonsecret avatar Jun 06 '22 10:06 neonsecret

Sorry, for some reason synth was from original release, despite i've downloaded russian one, now there are no errors.

But it's not reading russian letters, for some reason, that's what i have: https://www.dropbox.com/s/72xgcr17oisrzln/demo_output_00.wav?dl=0

Reference voice: enter an audio filepath of a voice to be cloned (mp3, wav, m4a, flac, ...): C:\Games\Thief Voice\WEBCALL\Garrett\english\gar0112.wav Loaded file succesfully Created the embedding Write a sentence (+-20 words) to be synthesized: Это хороший день для ограбления, попробуем ['e1', 't', 'o0', '', 'h', 'o0', 'r', 'o1', 'sh', 'i0', 'j', '', 'dj', 'e1', 'nj', '', 'd', 'lj', 'a1', '', 'o0', 'g', 'r', 'a0', 'b', 'lj', 'e1', 'nj', 'i0', 'j', 'a0', '', ',', 'p', 'o0', 'p', 'r', 'o0', 'b', 'u1', 'j', 'e0', 'm', '']

| Generating 1/1

Done.

Created the mel spectrogram Synthesizing the waveform: {| ████████████████ 437000/441600 | Batch Size: 46 | Gen Rate: 30.0kHz | }float64

Saved output as demo_output_00.wav

vorob1 avatar Jun 06 '22 11:06 vorob1

it's working as expected, the audio quality is just too low yet, also the sample audio snippet you provided might be too short to catch the voice. just wait for updates, the quality will improve with some time

neonsecret avatar Jun 06 '22 11:06 neonsecret

Tried recent version and now it produces proper speech, thanks!

The problem is that our ot 10 words in sentence it speaks only 1-2. Any chance this can be fixed and longer sequence be possible to generate?

vorob1 avatar Jul 06 '22 13:07 vorob1

I will look into it, it's an attention problem of some kind

neonsecret avatar Jul 06 '22 14:07 neonsecret

@neonsecret the pretrained synth model is gone from google drive, any chance you could reupload it?

chayleaf avatar Sep 15 '22 10:09 chayleaf

@neonsecret Hey! the problem with the fact that only a few words out of a dozen are voiced is still relevant. tell me, is there a solution for this?

bitnooob avatar Jan 31 '23 01:01 bitnooob