Real-Time-Voice-Cloning
Real-Time-Voice-Cloning copied to clipboard
Fixing neonsecret fork with RUS support (newbie questions)
**see my fork https://github.com/neonsecret/Real-Time-Voice-Cloning-Multilang it is adjusted to train the bilingual ru+en model and is easily adjustable for adding new languages
Originally posted by @neonsecret in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/707#issuecomment-1100036701**
!!! Had to create thread here cause i don't see issues tab on his fork page !!!
Sir, that's exactly what i'm looking for. I wanna correct some wrong voiceover in old game, but since i can't get in touch with actor i want to simulate his voice.
The subj tool works, but can't do russian voice https://youtu.be/lDbpoaaBJSo Your fork gives me errors:
PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python demo_toolbox.py
Traceback (most recent call last):
File "demo_toolbox.py", line 7, in
My knowledge on all these python stuff is low so i just copy paste commands, sometimes try to understand its errors, but this looks unsolvable with my level of knowledge.
I want simple thing, launch GUI, point program to WAV files with actor voice, enter text and get voiceover files :)
I also tried python demo_cli.py, got lot's of stuff but in the end it was this:
FileNotFoundError: [Errno 2] No such file or directory: 'saved_models\rusmodeltweaked\synthesizer.pt'
Okay i managed to turn on toolbox by copying some files from original build, now when i add wav and try synth +vocode i get this error:
size mismatch for encoder.embeddingweight: copying a param with shape torch.5ize([66, 512]) from chequoint, the shape in current model is tord1.Size([194, 512]).
okay l will look into it, but don't copy files from original repo, it doesn't work that way
by the way, the error you had about no such file or directory means you had to download a pretrained model, I will update Readme to add a link
I've redownload zip file. Now GUI launch file is missing, no demo_toolbox.py file.
I've tried demo_cli, but it shows same error about size i previously had in gui:
_PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python demo_cli.py Arguments: enc_model_fpath: saved_models\default\encoder.pt syn_model_fpath: saved_models\rusmodeltweaked\synthesizer.pt voc_model_fpath: saved_models\default\vocoder.pt cpu: False no_sound: False seed: None
Running a test of your configuration...
Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.
Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models\default\vocoder.pt
Testing your configuration with small inputs.
Testing the encoder...
Testing the synthesizer...
Trainable Parameters: 30.936M
Traceback (most recent call last):
File "demo_cli.py", line 95, in
p.s. i was downloading pretrained models from readme (https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models) but aren't they from english-only release? Maybe you can upload russian one? Thanks!
1.delete your repository and clone it again 2. only synthesizer matters for language 3. the synthetizer I've provided might not have the best audio quality, it's a lot of trial and error to train a good model
Looks like here on synth is not a hyper link. And demo_toolbox.py is still missing.
Checked readme file and link was there, solved :) https://drive.google.com/file/d/1qtGH8JzoY_v3h1v_zQyTSWiAW4bWYXsU/view?usp=sharing
sadly demo_cli.py gives me this error:
PS C:\Users\babud\Downloads\RRUS> python .\demo_cli.py
Arguments:
enc_model_fpath: saved_models\default\encoder.pt
syn_model_fpath: saved_models\rusmodeltweaked\synthesizer.pt
voc_model_fpath: saved_models\default\vocoder.pt
cpu: False
no_sound: False
seed: None
Running a test of your configuration...
Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.
Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models\default\vocoder.pt
Testing your configuration with small inputs.
Testing the encoder...
Testing the synthesizer...
Trainable Parameters: 30.936M
Loaded synthesizer "synthesizer.pt" trained to step 68800
| Generating 1/1
Traceback (most recent call last):
File ".\demo_cli.py", line 95, in <module>
mels = synthesizer.synthesize_spectrograms(texts, embeds)
File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in synthesize_spectrograms
text_lens = [len(text) for text in batch]
File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in <listcomp>
text_lens = [len(text) for text in batch]
TypeError: object of type 'int' has no len()
In line 109: convert it to a string.
self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
try converting it to (len(str
len() function cannot be called with an integer, try dir or range if it does not work.
In line 109: convert it to a string.
self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)try converting it to (len(str
len() function cannot be called with an integer, try dir or range if it does not work.
no its not that, I'm working on the project now, I will reply in this issue when I'm done and did all the tests
Running a test of your configuration... Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.
Its recommended to use a cloud GPU provider training on a laptop is slow, slower with a compacted one.
sadly demo_cli.py gives me this error:
PS C:\Users\babud\Downloads\RRUS> python .\demo_cli.py Arguments: enc_model_fpath: saved_models\default\encoder.pt syn_model_fpath: saved_models\rusmodeltweaked\synthesizer.pt voc_model_fpath: saved_models\default\vocoder.pt cpu: False no_sound: False seed: None Running a test of your configuration... Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory. Preparing the encoder, the synthesizer and the vocoder... Loaded encoder "encoder.pt" trained to step 1564501 Synthesizer using device: cuda Building Wave-RNN Trainable Parameters: 4.481M Loading model weights at saved_models\default\vocoder.pt Testing your configuration with small inputs. Testing the encoder... Testing the synthesizer... Trainable Parameters: 30.936M Loaded synthesizer "synthesizer.pt" trained to step 68800 | Generating 1/1 Traceback (most recent call last): File ".\demo_cli.py", line 95, in <module> mels = synthesizer.synthesize_spectrograms(texts, embeds) File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in synthesize_spectrograms text_lens = [len(text) for text in batch] File "C:\Users\babud\Downloads\RRUS\synthesizer\models\tacotron\inference.py", line 109, in <listcomp> text_lens = [len(text) for text in batch] TypeError: object of type 'int' has no len()
Did you add all the russian letters in the newly cloned repo?
What do you mean? Where do i need to add russian letters and why?
okay, readme updated, files updated, model trained, everything should work fine except for the audio quality, model still needs finetuning, so feel free to delete everything from the directory and clone from scratch
demo_cli also updated, found the bug
sound quality is very poor
sound quality is very poor
yes, the model isn't trained that good yet, I'm working on it
and he can synthesize Russian speech?
Sadly it still gives me this error:
`PS C:\Users\babud\Downloads\Real-Time-Voice-Cloning-Multilang-master> python .\demo_cli.py Arguments: enc_model_fpath: saved_models\default\encoder.pt syn_model_fpath: saved_models\default\synthesizer.pt voc_model_fpath: saved_models\default\vocoder.pt cpu: False no_sound: False seed: None
Running a test of your configuration...
Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 2080 with Max-Q Design) of compute capability 7.5 with 8.6Gb total memory.
Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models\default\vocoder.pt
Testing your configuration with small inputs.
Testing the encoder...
Testing the synthesizer...
Trainable Parameters: 30.936M
Traceback (most recent call last):
File ".\demo_cli.py", line 95, in
you downloaded the wrong model, be careful, download the synthesizer from the Google drive link from readme and overwrite the file. and did you clone the repository from scratch after the recent fixes?
and put it into "default" folder of the "saved_models" folder
Sorry, for some reason synth was from original release, despite i've downloaded russian one, now there are no errors.
But it's not reading russian letters, for some reason, that's what i have: https://www.dropbox.com/s/72xgcr17oisrzln/demo_output_00.wav?dl=0
Reference voice: enter an audio filepath of a voice to be cloned (mp3, wav, m4a, flac, ...):
C:\Games\Thief Voice\WEBCALL\Garrett\english\gar0112.wav
Loaded file succesfully
Created the embedding
Write a sentence (+-20 words) to be synthesized:
Это хороший день для ограбления, попробуем
['e1', 't', 'o0', '
| Generating 1/1
Done.
Created the mel spectrogram Synthesizing the waveform: {| ████████████████ 437000/441600 | Batch Size: 46 | Gen Rate: 30.0kHz | }float64
Saved output as demo_output_00.wav
it's working as expected, the audio quality is just too low yet, also the sample audio snippet you provided might be too short to catch the voice. just wait for updates, the quality will improve with some time
Tried recent version and now it produces proper speech, thanks!
The problem is that our ot 10 words in sentence it speaks only 1-2. Any chance this can be fixed and longer sequence be possible to generate?
I will look into it, it's an attention problem of some kind
@neonsecret the pretrained synth model is gone from google drive, any chance you could reupload it?
@neonsecret Hey! the problem with the fact that only a few words out of a dozen are voiced is still relevant. tell me, is there a solution for this?