VoiceCraft
VoiceCraft copied to clipboard
Added gradio app
Replacing mfa with whisper made it faster than jupyter. It supported TTS, long-form TTS and speech editing. Everyting is on a single page and very simple to use.
Thank you so much for this, I tried to change input_audio.change to input_audio.upload in gradio_app.py so that it supports user uploaded audio, but after hitting the Run button it will then take forever. Tried debugging, but could really figure out. Could you help with that? (with the default audio everything works just fine; with uploaded audiom 'Transcribe' button works just fine)
Please, provide more details (audio, settings). I changed input_audio.change to input_audio.upload and tried several files files, but could not reproduce the issue.
It worked for me but I'm not sure why it was locked to only the demo audio. I just made it editable and then the UI works.
It worked for me but I'm not sure why it was locked to only the demo audio. I just made it editable and then the UI works.
By making it editable, what did you change?
with gr.Row():
with gr.Column(scale=2):
input_audio = gr.Audio(value="./demo/84_121550_000074_000000.wav", label="Input Audio", type="filepath", interactive=False)
with gr.Group():
I just changed interactive to true.
are you sure? probably also need to change input_audio.change to input_audio.upload right, otherwise it will give an error when up clear the original audio
It worked like this. I didn't check the console, sometimes there are exceptions. Usually reloading fixes it. I was using it all day yesterday after making it listen on more than just localhost. This UI doesn't have the seed implemented though and for some reason has old torch in the requirements. Previously I just turned the notebooks into python scripts. Very thankful to not have to use MFA, it is a pain to make it work.
Ah, I see. I unlocked audio only after models are loaded, but it was locking again after you update the page, now it's fixed. Also fixed errors when you clear the original audio, added seed support and improved demo and ui.
New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.
Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge?
https://github.com/Sewlell/VoiceCraft-gradio-colab
New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.
The reason for that is that Whisper timestamps is not very accurate (it will cut a word in half). Forced alignment (i.e. getting timestamps) is a solved problem and MFA can do a perfect job (but it's slow and in some cases difficult to install). So I'm still figuring out an equally accurate replacement.
New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.
The reason for that is that Whisper timestamps is not very accurate (it will cut a word in half). Forced alignment (i.e. getting timestamps) is a solved problem and MFA can do a perfect job (but it's slow and in some cases difficult to install). So I'm still figuring out an equally accurate replacement.
Added whisperX, it's more precise, faster and supports forced alignment
whisperX is missing some arguments so I haven't been able to try it. Using larger whisper models also helped with the timestamps. I guess the only other foible I noticed is that the audio is a little quieter than the source. I have to open it in audacity and ++ the gain.
Plus you don't want this stuff:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
No need to re-arrange people's multi-gpu systems or force select their GPU 0. If you have only one GPU it will find it as long as you have cuda installed. Don't think torch will select an IGPU or something in place of cuda either.
Thanks for the amazing work! One thing I realized is that once you hit Run, if you want to change something, i.e. upload a different audio file, or change a seed, you need to refresh the page, otherwise running the models will just hang. Is that expected?
Thanks for the amazing work! One thing I realized is that once you hit Run, if you want to change something, i.e. upload a different audio file, or change a seed, you need to refresh the page, otherwise running the models will just hang. Is that expected?
Never happened in my environment, try to run it with debug logging to see what went wrong.
I got it all working, had to use the correct version of whisperx. The one from pip didn't work. Had to use git version. Sometimes I get a bug that jumbles part of the prompt and then it says phonemizer: words count mismatch on xxx% of lines. The copied part of the prompt loses all the spaces. Will retest post: https://github.com/jasonppy/VoiceCraft/pull/54/commits/6f71fa65fb8d6efaf54cde474009e9d78bebfe94 and see if it still does it.
edit: still get the words count mismatch phonemizer warning but no longer get scrambled transcript. Best model for timestamps is medium.en.. also not sure if it works better with or without alignment, it's hard to tell.
words count mismatch phonemizer warning
'words count mismatch phonemizer warning' is complete fine
I can take a look at this.
BTW, I had already used the new model this morning. Output is fairly similar. I have not tried with less batches. Any ideas on why sometimes the output is really quiet, even if the source audio is loud?
BTW, I had already used the new model this morning. Output is fairly similar. I have not tried with less batches. Any ideas on why sometimes the output is really quiet, even if the source audio is loud?
The new model should work well with small batch sizes, and therefore requires less VRAM and inference time.
But I'm running a larger scale job and support longer utterances, would take longer to finish
It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell
The new model should work well with small batch sizes, and therefore requires less VRAM and inference time.
It still gives better results with 4. Gave ok results at 1 and 2.
It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell
This might happen if gradio can't access /tmp/gradio
It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell
This might happen if gradio can't access /tmp/gradio
I have changed $GRADIO_TEMP_DIR and $AUDIOCRAFT_DORA_DIR to local dict, but it doesn't work.
Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge?
https://github.com/Sewlell/VoiceCraft-gradio-colab
Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?
Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge? https://github.com/Sewlell/VoiceCraft-gradio-colab
Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?
Done
Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge? https://github.com/Sewlell/VoiceCraft-gradio-colab
Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?
I added colab notebook to my repo, after merge you should probably change link to colab notebook in Readme.md and link to repo and paths in voicecraft-gradio-colab.ipynb
It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell
This might happen if gradio can't access /tmp/gradio
I have changed
$GRADIO_TEMP_DIRand$AUDIOCRAFT_DORA_DIRto local dict, but it doesn't work.
I finally solved this problem... Before, my $GRADIO_TEMP_DIR and $AUDIOCRAFT_DORA_DIR were /xxx/.cache/xxxx,then I changed .cache to tmp,and it works.
