IMS-Toucan HF Gradio demo: sudden gender flip for slider

I've added Toucan to the TTS Arena fork by using the MassivelyMultilingualTTS space. Arena: https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena TTS Space: https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS

After some time the "Gender of artificial Voice" slider values are flipped. I take it it always was meant to mean that -10 is the lowest average pitch and +10 the highest. Therefore it is a male/female slider in that order. Yet it sometimes flips in reverse.

Is something in the model reconfiguring?

Right now, a positive value means male gender on the space.

Oct 24 '24 11:10 Pendrokar

Hi, thanks for including Toucan in the Arena!

The gender slider is not related to the pitch, it specifies a rotation around a principal component axis in the latent space of the speaker embedding generator.

If no voice reference is given, the system will use an artificial speaker embedding that is not linked to any real human, but is instead generated by a GAN that learned to match the distribution of speaker embeddings. This generation process can be manipulated by this rotation. The direction of the rotation is not always the same, since a generated artificial speaker embedding might be flipped upside-down through a rotation on another axis. So the slider does not have a static direction, we can never know if the slider is masculine or feminine to the left or the right. It is different for every speaker embedding, and a new set of speaker embeddings is generated with every restart of the space. So every day there are new voices.

For the arena, it's probably a good idea to keep the speaker always the same, right? I can make the random seed static, then we always have the same voices. Or, since the arena only supports English, I can make a separate space from which you can use the API that uses the real default embedding and not a generated artificial one.

Oct 24 '24 14:10 Flux9665

Ok, so I am not going crazy.

Also cloning never works for me. It still seems to take the generated artifical speaker.

I am thinking of using multiple voices and languages for the arena in the future. But for now it is a single female American-English voice.

So I would still need a more deterministic outcome.

[edit] As Toucan is being rejected even in favor of the lowest ranked models such as OpenVoice2 and WhisperSpeech. https://huggingface.co/datasets/Pendrokar/TTS_Arena/viewer/default/train?f[rejected][value]=%27Flux9665/MassivelyMultilingualTTS%27

Oct 25 '24 10:10 Pendrokar

I made a space that you can use for this. It features just a female American English voice and the inputs are greatly simplified, it's just the text and nothing else.

https://huggingface.co/spaces/Flux9665/EnglishToucan

Without the artificial speaker embeddings, I'm expecting much better and much more consistent results, that more accurately reflect what the model is capable of.

Oct 25 '24 11:10 Flux9665