Create a Huggingface demo page
The output quality is good enough that it would be useful to allow more folks to test our model out on Huggingface.
Hi , @jpc i m very much new to open source and want to start contributing , I came across this repo and saw good first issue label . It will be really helpful if you could just tell me how to work on this issue , where to start. I m excited to work on this issue.
Hey, @Josephrp started working on this one just yesterday. Maybe you can reach out to him on the LAION Discord (link in the README, his username: Tonic) and ask if he needs any help.
hey, what is the status of this feature? I can make it work this week. Can I help @Josephrp ?
hey there @gongouveia @jpc @itsharshitrwt , i'm so glad i finally caught these, with all my appologies, i'll basically be finalizing the demo today, but help is totally welcome and encouraged ! how fun :-)
outstanding tasks are simple :
- review and improve the demo branding (and model card ?) for whisperspeech , there's a lot of important information we want folks to have and also have that in the demo code :-)
- we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice
- we'll create a simple text parser and audio parser to basically allow the user to input a multilingual string , as so :
<pl> bobber kurkuma <fr> oh mon dieu qu'est-ce-que c'est ? <hin>ये महत्वाकांक्षी लक्ष्य हैंand get multilingual with emotion based on whether or not an audio file is provided for the voice. - for V2 i think we can improve the text parser with autodetect, cant we ?
to summarize :
simple more powerful and better looking single screen app with a little backend flair to show off the whisperspeech library in a good way = today's goal :-)
@jpc @gongouveia @itsharshitrwt , oops forgot to share the demo link:
https://huggingface.co/spaces/Tonic/whisperspeech/
it's been up already some days... and i've just been kinda sharing it around , as i've been gearing up, but not here, yet, so sorry about that. please take a look there, you can fork that one and make PRs directly, the easiest is to coordinate on issues and make PRs accordingly, even if it takes a tiny bit more time :-)
@itsharshitrwt @gongouveia , i've updated the space name here : https://huggingface.co/spaces/Tonic/whisperspeech/
join me here : https://discord.gg/4r9akpvF?event=1200011410638393404
- we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice
Will join as soon as possible, I believe the space is good enough already to showcase. However in "we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice", my personal preference is inputs in left and outputs in right side. Generate speech button is too big now, I will propose a new interface configuration :)
- we'll simplify the interface for sure one last and thought-through refactor to make super easy to do a multilingual voice print and a multilingual text to voice
preference is inputs in left and outputs in right side. Generate speech button is too big now, I will propose a new interface configuration :)
that's great, dont launch forth just yet because i'm pushing the next version
i need help with that interface stuffs :-) @gongouveia
hey there , i refactored the functional demo code for a better example code, so i'll just make a fork for that and push to readme or something (?) @jpc with a big thanks to jose that helped out too :-)
so i made a fork here : https://github.com/Josephrp/WhisperSpeech/tree/main with the gradio demo added
or maybe it should go into a folder on it's own with all the right attributes ? it's just slightly different than the actual hosted demo for... reasons.
@gongouveia @itsharshitrwt @jpc , open issues are in the discussions tab here : https://huggingface.co/spaces/Tonic/whisperspeech/discussions
- reformat the language list (and therefore the interface layout - one last time :-) )
- adding cool examples
- reach goal : slightly better parser for handling punctuation and common user input formats ;-)
@jpc just starting the final changes to the layout namely :
- display the language tags inside
- small reformat to clean up the "tabs" , put the language tags on the bottom display=false by default + it should be renamed Language Tags or help + small improvements for the uer
afterwards i will port these changes to the demo app code here : https://github.com/collabora/WhisperSpeech/issues/39#issuecomment-1910207333
so, i do prefer the new layout and language list, just pushing to my fork and standing by for your instructions for how to proceed @jpc
@itsharshitrwt @gongouveia : see the PR here : https://github.com/collabora/WhisperSpeech/pull/63 it would be useful to work together on the outstanding issues
I will play around with it a bit today
I will play around with it a bit today
@Josephrp I'm looking on it now. Will work on:
- new examples
- Caching for examples.
- adding Queue (never tried it with gpu zero)
- Rewriting text, has some errors
- Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.
- Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.
sorry... outputs above is standard, here are three examples with outputs above :
- github issues
- ableton
- every ai chat app ever.
just check any of these ;-)
hope you have fun, let's connect whenever you want you can find me :-)
- Redo Layout, it should be inputs, outputs. outputs above doesn´t make sense.
sorry... outputs above is standard, here are three examples with outputs above :
- github issues
- ableton
- every ai chat app ever.
just check any of these ;-)
hope you have fun, let's connect whenever you want you can find me :-)
Got it,
I fancy as this, whats your opinion?
Btw in the current version, the mic/upload audio button is messed up. Will add examples with pairs of text/ voice tomorrow
i also noticed that, about the mic and upload icons on my 32bit windows computer replacement but i thought it was just my OS :-)
i simply cannot imagine what i need to fix about it because it's... a gradio element i guess, nothing custom... it's a headscratcher , so i'll think with you about it .
new layout is much cleaner, the smaller drop downs are easier to see with the text makes more sense. if we're quickly experimenting , could even be to switch it with audio upload, i mean , jsut to see if it's better/same/worse .
one suggestion, maybe, for the audio+text examples is to put some examples also on the audio upload side , more clicking on easy things is always great. (just my random thoughts! sorry if it's wrong :-) )
final thing :
check the demo actually, i think it could be you, or maybe not ? but @mrfakename basically made a really cool PR , eg :
@spaces.GPU(enable_queue=True)
def whisper_speech_demo(multilingual_text, speaker_audio):
this actually fixed some performance issues i was seeing with a lot of languages together. i'm very happy, so check it out in case its not your push :-)
It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization.
Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?
It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization.
Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?
i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.
btw i'm having performance issues now on huggingface, but i think it's just huggingface, since uhm the demo was working not too long ago... might revert the changes, but i'll check it tomorrow :-)
https://huggingface.co/spaces/Tonic/whisperspeech/discussions/4
check this discussion for the logs from an error that ... is strange + i'm investigating in the context of the recent collabora updates , help also there , is more than welcome.
It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization. Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?
i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.
The way whisper architecture trasncribes long from audios, is chunking them in 30 seconds audio sample samples with little of overlap between chunks.
I am lacking creativity to find audio voice samples, plus I don´t know what can be copyrighted or not tho. maybe will use some librespeech dataset samples
It is not my push, I also checked if this gradio element, was just working badly in microsoft edge, but in chrome has same behavior. I recently also found that the audio element behaves differently in different search engines, in chrome it does some weird audio normalization. Tell me, audio samples must be how many seconds long, 30 seconds as in whisper?
i didnt realize there was a top limit, but normally more is better, 20 seconds or more is better.
The way whisper architecture trasncribes long from audios, is chunking them in 30 seconds audio sample samples with little of overlap between chunks.
I am lacking creativity to find audio voice samples, plus I don´t know what can be copyrighted or not tho. maybe will use some librespeech dataset samples
interesting ! so maybe a little session of peercoding is in order to wrap up the outstanding issues, normally chunking in 30 secords should be easy and possible + it seems like it's quite hard to record a whole 30 seconds :-) honestly i was just going to upload my own voice saying goofy and feel good stuff :-) i think it's too late to protect my voice print already...
hmmm @gongouveia , tonight maybe it would be nice to also re organise the interface : currently there's audio upload and text in a row below the output... but... you're right, it doesnt make sense... maybe also it could look a bit more slick imho, so by reorganising the upload to it's own row perhaps, and perhaps "on top of" text input , thinking that it could be combined with the language tags for extra information in a good way, perhaps as rows inside a single accordion .
You mean peer coding in discord? If thats the case I believe our time zone is very different
You mean peer coding in discord? If thats the case I believe our time zone is very different
so basically i experimented with the interface a little bit, i think it's maybe better than before, open to changes.
just one issue outstanding with the parser , it could be that the last parsed string is not getting passed , and it could be a workaround to append a sort of end of string tag before parsing it... to correctly parse it. would you want to help me take a look ?