VoiceCraft Added gradio app

Replacing mfa with whisper made it faster than jupyter. It supported TTS, long-form TTS and speech editing. Everyting is on a single page and very simple to use.

Apr 02 '24 15:04 zuev-stepan

Thank you so much for this, I tried to change input_audio.change to input_audio.upload in gradio_app.py so that it supports user uploaded audio, but after hitting the Run button it will then take forever. Tried debugging, but could really figure out. Could you help with that? (with the default audio everything works just fine; with uploaded audiom 'Transcribe' button works just fine)

Apr 03 '24 02:04 jasonppy

Please, provide more details (audio, settings). I changed input_audio.change to input_audio.upload and tried several files files, but could not reproduce the issue.

Apr 03 '24 06:04 zuev-stepan

It worked for me but I'm not sure why it was locked to only the demo audio. I just made it editable and then the UI works.

Apr 03 '24 14:04 Ph0rk0z

It worked for me but I'm not sure why it was locked to only the demo audio. I just made it editable and then the UI works.

By making it editable, what did you change?

Apr 03 '24 14:04 jasonppy

    with gr.Row():
        with gr.Column(scale=2):
            input_audio = gr.Audio(value="./demo/84_121550_000074_000000.wav", label="Input Audio", type="filepath", interactive=False)
            with gr.Group():

I just changed interactive to true.

Apr 03 '24 14:04 Ph0rk0z

are you sure? probably also need to change input_audio.change to input_audio.upload right, otherwise it will give an error when up clear the original audio

Apr 03 '24 15:04 jasonppy

It worked like this. I didn't check the console, sometimes there are exceptions. Usually reloading fixes it. I was using it all day yesterday after making it listen on more than just localhost. This UI doesn't have the seed implemented though and for some reason has old torch in the requirements. Previously I just turned the notebooks into python scripts. Very thankful to not have to use MFA, it is a pain to make it work.

Apr 03 '24 15:04 Ph0rk0z

Ah, I see. I unlocked audio only after models are loaded, but it was locking again after you update the page, now it's fixed. Also fixed errors when you clear the original audio, added seed support and improved demo and ui.

Apr 03 '24 17:04 zuev-stepan

New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.

Apr 04 '24 11:04 Ph0rk0z

Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge?

https://github.com/Sewlell/VoiceCraft-gradio-colab

Apr 04 '24 13:04 Sewlell

New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.

The reason for that is that Whisper timestamps is not very accurate (it will cut a word in half). Forced alignment (i.e. getting timestamps) is a solved problem and MFA can do a perfect job (but it's slow and in some cases difficult to install). So I'm still figuring out an equally accurate replacement.

Apr 04 '24 15:04 jasonppy

New version is great. The voices sound a lot better. Only thing is that I get a bit of the last word it's continuing from in my actual prompt and have been cutting it out.

The reason for that is that Whisper timestamps is not very accurate (it will cut a word in half). Forced alignment (i.e. getting timestamps) is a solved problem and MFA can do a perfect job (but it's slow and in some cases difficult to install). So I'm still figuring out an equally accurate replacement.

Added whisperX, it's more precise, faster and supports forced alignment

Apr 04 '24 19:04 zuev-stepan

whisperX is missing some arguments so I haven't been able to try it. Using larger whisper models also helped with the timestamps. I guess the only other foible I noticed is that the audio is a little quieter than the source. I have to open it in audacity and ++ the gain.

Plus you don't want this stuff:


    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

No need to re-arrange people's multi-gpu systems or force select their GPU 0. If you have only one GPU it will find it as long as you have cuda installed. Don't think torch will select an IGPU or something in place of cuda either.

Apr 04 '24 22:04 Ph0rk0z

Thanks for the amazing work! One thing I realized is that once you hit Run, if you want to change something, i.e. upload a different audio file, or change a seed, you need to refresh the page, otherwise running the models will just hang. Is that expected?

Apr 05 '24 02:04 jasonppy

Thanks for the amazing work! One thing I realized is that once you hit Run, if you want to change something, i.e. upload a different audio file, or change a seed, you need to refresh the page, otherwise running the models will just hang. Is that expected?

Never happened in my environment, try to run it with debug logging to see what went wrong.

Apr 05 '24 04:04 zuev-stepan

I got it all working, had to use the correct version of whisperx. The one from pip didn't work. Had to use git version. Sometimes I get a bug that jumbles part of the prompt and then it says phonemizer: words count mismatch on xxx% of lines. The copied part of the prompt loses all the spaces. Will retest post: https://github.com/jasonppy/VoiceCraft/pull/54/commits/6f71fa65fb8d6efaf54cde474009e9d78bebfe94 and see if it still does it.

edit: still get the words count mismatch phonemizer warning but no longer get scrambled transcript. Best model for timestamps is medium.en.. also not sure if it works better with or without alignment, it's hard to tell.

Apr 05 '24 10:04 Ph0rk0z

words count mismatch phonemizer warning

'words count mismatch phonemizer warning' is complete fine

Apr 05 '24 14:04 jasonppy

I can take a look at this.

Apr 06 '24 17:04 pgosar

BTW, I had already used the new model this morning. Output is fairly similar. I have not tried with less batches. Any ideas on why sometimes the output is really quiet, even if the source audio is loud?

Apr 06 '24 21:04 Ph0rk0z

BTW, I had already used the new model this morning. Output is fairly similar. I have not tried with less batches. Any ideas on why sometimes the output is really quiet, even if the source audio is loud?

The new model should work well with small batch sizes, and therefore requires less VRAM and inference time.

But I'm running a larger scale job and support longer utterances, would take longer to finish

Apr 07 '24 01:04 jasonppy

It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell

Apr 08 '24 08:04 Approximetal

The new model should work well with small batch sizes, and therefore requires less VRAM and inference time.

It still gives better results with 4. Gave ok results at 1 and 2.

Apr 08 '24 11:04 Ph0rk0z

It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell

This might happen if gradio can't access /tmp/gradio

Apr 10 '24 00:04 zuev-stepan

It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell

This might happen if gradio can't access /tmp/gradio

I have changed $GRADIO_TEMP_DIR and $AUDIOCRAFT_DORA_DIR to local dict, but it doesn't work.

Apr 10 '24 02:04 Approximetal

Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge?

https://github.com/Sewlell/VoiceCraft-gradio-colab

Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?

Apr 11 '24 00:04 jasonppy

Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge? https://github.com/Sewlell/VoiceCraft-gradio-colab

Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?

Done

Apr 11 '24 03:04 Sewlell

Have make a Colab version of @zuev-stepan 's VoiceCraft fork. I think it should be as well part of the merge? https://github.com/Sewlell/VoiceCraft-gradio-colab

Thanks! I have tested @zuev-stepan and your colab and I'm ready to merge. Could you push to this PR so I can incorporate your contribution?

I added colab notebook to my repo, after merge you should probably change link to colab notebook in Readme.md and link to repo and paths in voicecraft-gradio-colab.ipynb

Apr 11 '24 12:04 zuev-stepan

It seems output_audio cannot been displayed after a sucessful run. Do you have any idea? @zuev-stepan @Sewlell

This might happen if gradio can't access /tmp/gradio

I have changed $GRADIO_TEMP_DIR and $AUDIOCRAFT_DORA_DIR to local dict, but it doesn't work.

I finally solved this problem... Before, my $GRADIO_TEMP_DIR and $AUDIOCRAFT_DORA_DIR were /xxx/.cache/xxxx，then I changed .cache to tmp，and it works.

Jun 06 '24 09:06 Approximetal

VoiceCraft VoiceCraft copied to clipboard

Added gradio app

VoiceCraft
VoiceCraft copied to clipboard