WhisperSpeech Create a Huggingface demo page

The output quality is good enough that it would be useful to allow more folks to test our model out on Huggingface.

Jan 10 '24 10:01 jpc

Hi , @jpc i m very much new to open source and want to start contributing , I came across this repo and saw good first issue label . It will be really helpful if you could just tell me how to work on this issue , where to start. I m excited to work on this issue.

Jan 20 '24 14:01 harshitrwt

Hey, @Josephrp started working on this one just yesterday. Maybe you can reach out to him on the LAION Discord (link in the README, his username: Tonic) and ask if he needs any help.

Jan 20 '24 17:01 jpc

hey, what is the status of this feature? I can make it work this week. Can I help @Josephrp ?

Jan 24 '24 22:01 gongouveia

hey there @gongouveia @jpc @itsharshitrwt , i'm so glad i finally caught these, with all my appologies, i'll basically be finalizing the demo today, but help is totally welcome and encouraged ! how fun :-)

outstanding tasks are simple :

review and improve the demo branding (and model card ?) for whisperspeech , there's a lot of important information we want folks to have and also have that in the demo code :-)
we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice
we'll create a simple text parser and audio parser to basically allow the user to input a multilingual string , as so : <pl> bobber kurkuma <fr> oh mon dieu qu'est-ce-que c'est ? <hin>ये महत्वाकांक्षी लक्ष्य हैं and get multilingual with emotion based on whether or not an audio file is provided for the voice.
for V2 i think we can improve the text parser with autodetect, cant we ?

to summarize :

simple more powerful and better looking single screen app with a little backend flair to show off the whisperspeech library in a good way = today's goal :-)

Jan 25 '24 08:01 Josephrp

@jpc @gongouveia @itsharshitrwt , oops forgot to share the demo link:

https://huggingface.co/spaces/Tonic/whisperspeech/

it's been up already some days... and i've just been kinda sharing it around , as i've been gearing up, but not here, yet, so sorry about that. please take a look there, you can fork that one and make PRs directly, the easiest is to coordinate on issues and make PRs accordingly, even if it takes a tiny bit more time :-)

Jan 25 '24 08:01 Josephrp

@itsharshitrwt @gongouveia , i've updated the space name here : https://huggingface.co/spaces/Tonic/whisperspeech/

Jan 25 '24 09:01 Josephrp

join me here : https://discord.gg/4r9akpvF?event=1200011410638393404

Jan 25 '24 09:01 Josephrp

we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice

Will join as soon as possible, I believe the space is good enough already to showcase. However in "we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice", my personal preference is inputs in left and outputs in right side. Generate speech button is too big now, I will propose a new interface configuration :)

Jan 25 '24 09:01 gongouveia

we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice

preference is inputs in left and outputs in right side. Generate speech button is too big now, I will propose a new interface configuration :)

that's great, dont launch forth just yet because i'm pushing the next version

Jan 25 '24 09:01 Josephrp

i need help with that interface stuffs :-) @gongouveia

Jan 25 '24 09:01 Josephrp

hey there , i refactored the functional demo code for a better example code, so i'll just make a fork for that and push to readme or something (?) @jpc with a big thanks to jose that helped out too :-)

Jan 25 '24 13:01 Josephrp

so i made a fork here : https://github.com/Josephrp/WhisperSpeech/tree/main with the gradio demo added

Jan 25 '24 13:01 Josephrp

or maybe it should go into a folder on it's own with all the right attributes ? it's just slightly different than the actual hosted demo for... reasons.

Jan 25 '24 13:01 Josephrp

@gongouveia @itsharshitrwt @jpc , open issues are in the discussions tab here : https://huggingface.co/spaces/Tonic/whisperspeech/discussions

reformat the language list (and therefore the interface layout - one last time :-) )
adding cool examples
reach goal : slightly better parser for handling punctuation and common user input formats ;-)

Jan 26 '24 12:01 Josephrp

@jpc just starting the final changes to the layout namely :

display the language tags inside
small reformat to clean up the "tabs" , put the language tags on the bottom display=false by default + it should be renamed Language Tags or help + small improvements for the uer

afterwards i will port these changes to the demo app code here : https://github.com/collabora/WhisperSpeech/issues/39#issuecomment-1910207333

Jan 27 '24 09:01 Josephrp

so, i do prefer the new layout and language list, just pushing to my fork and standing by for your instructions for how to proceed @jpc

Jan 27 '24 10:01 Josephrp

@itsharshitrwt @gongouveia : see the PR here : https://github.com/collabora/WhisperSpeech/pull/63 it would be useful to work together on the outstanding issues

Jan 27 '24 11:01 Josephrp

I will play around with it a bit today

Jan 27 '24 12:01 gongouveia

I will play around with it a bit today

@Josephrp I'm looking on it now. Will work on:

new examples
Caching for examples.
adding Queue (never tried it with gpu zero)
Rewriting text, has some errors
Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.

Jan 27 '24 22:01 gongouveia

Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.

sorry... outputs above is standard, here are three examples with outputs above :

github issues
ableton
every ai chat app ever.

just check any of these ;-)

hope you have fun, let's connect whenever you want you can find me :-)

Jan 27 '24 23:01 Josephrp

Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.

sorry... outputs above is standard, here are three examples with outputs above :

github issues

ableton

every ai chat app ever.

just check any of these ;-)

hope you have fun, let's connect whenever you want you can find me :-)

Got it, I fancy as this, whats your opinion? gota

Btw in the current version, the mic/upload audio button is messed up. Will add examples with pairs of text/ voice tomorrow

Jan 27 '24 23:01 gongouveia

i also noticed that, about the mic and upload icons on my 32bit windows computer replacement but i thought it was just my OS :-)

i simply cannot imagine what i need to fix about it because it's... a gradio element i guess, nothing custom... it's a headscratcher , so i'll think with you about it .

new layout is much cleaner, the smaller drop downs are easier to see with the text makes more sense. if we're quickly experimenting , could even be to switch it with audio upload, i mean , jsut to see if it's better/same/worse .

one suggestion, maybe, for the audio+text examples is to put some examples also on the audio upload side , more clicking on easy things is always great. (just my random thoughts! sorry if it's wrong :-) )

final thing :

check the demo actually, i think it could be you, or maybe not ? but @mrfakename basically made a really cool PR , eg :

@spaces.GPU(enable_queue=True)
def whisper_speech_demo(multilingual_text, speaker_audio):

this actually fixed some performance issues i was seeing with a lot of languages together. i'm very happy, so check it out in case its not your push :-)

Jan 27 '24 23:01 Josephrp

It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization.

Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?

Jan 27 '24 23:01 gongouveia

It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization.

Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?

i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.

btw i'm having performance issues now on huggingface, but i think it's just huggingface, since uhm the demo was working not too long ago... might revert the changes, but i'll check it tomorrow :-)

Jan 28 '24 00:01 Josephrp

https://huggingface.co/spaces/Tonic/whisperspeech/discussions/4

check this discussion for the logs from an error that ... is strange + i'm investigating in the context of the recent collabora updates , help also there , is more than welcome.

Jan 28 '24 10:01 Josephrp

It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization. Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?

i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.

The way whisper architecture trasncribes long from audios, is chunking them in 30 seconds audio sample samples with little of overlap between chunks.

I am lacking creativity to find audio voice samples, plus I don´t know what can be copyrighted or not tho. maybe will use some librespeech dataset samples

Jan 29 '24 09:01 gongouveia

It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization. Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?

i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.

The way whisper architecture trasncribes long from audios, is chunking them in 30 seconds audio sample samples with little of overlap between chunks.

I am lacking creativity to find audio voice samples, plus I don´t know what can be copyrighted or not tho. maybe will use some librespeech dataset samples

interesting ! so maybe a little session of peercoding is in order to wrap up the outstanding issues, normally chunking in 30 secords should be easy and possible + it seems like it's quite hard to record a whole 30 seconds :-) honestly i was just going to upload my own voice saying goofy and feel good stuff :-) i think it's too late to protect my voice print already...

Jan 29 '24 11:01 Josephrp

hmmm @gongouveia , tonight maybe it would be nice to also re organise the interface : currently there's audio upload and text in a row below the output... but... you're right, it doesnt make sense... maybe also it could look a bit more slick imho, so by reorganising the upload to it's own row perhaps, and perhaps "on top of" text input , thinking that it could be combined with the language tags for extra information in a good way, perhaps as rows inside a single accordion .

Jan 29 '24 13:01 Josephrp

You mean peer coding in discord? If thats the case I believe our time zone is very different

Jan 29 '24 13:01 gongouveia

You mean peer coding in discord? If thats the case I believe our time zone is very different

so basically i experimented with the interface a little bit, i think it's maybe better than before, open to changes.

just one issue outstanding with the parser , it could be that the last parsed string is not getting passed , and it could be a workaround to append a sort of end of string tag before parsing it... to correctly parse it. would you want to help me take a look ?

Jan 29 '24 17:01 Josephrp