Could this theoretically be retrained from scratch to generate singing vocals?
Given a 10k hour dataset of singing vocals (instead of the current audiobook reading content), could this model be ported to be able to sing/generate vocals?
@Saltb0xApps I was thinking the same thing and also adding another input conditioning, like background music for the generated audio to follow. I am not exactly an ML engineer but here is my rough thinking.
-
Text Encoder (Unchanged): Continues to map text descriptions to a sequence of hidden-state representations using a frozen text encoder initialized from Flan-T5.
-
Music Encoder: A new component that takes background music as input and generates a music-conditioned representation using a pretrained music autoencoder (DAC or EnCodec). This encoder analyses the background music to extract features such as tempo, key, mood, and rhythm, which will be used to condition the generated speech.
-
Parler-TTS Decoder (Modified): The decoder now auto-regressively generates audio tokens conditional not only on the encoder hidden-state representations (from text) but also on the music-conditioned representation. To incorporate the music-conditioned representation, you could either:
- Concatenate: Directly concatenate the music-conditioned representation with the text-conditioned hidden states before feeding them into the decoder.
- Cross-Attention Modification: Integrate the music-conditioned representation into the cross-attention layers of the decoder, allowing the decoder to attend to both text and music features simultaneously.
- Audio Codec (Unchanged): Continues to recover the audio waveform from the audio tokens predicted by the decoder using the DAC model or EnCodec as preferred.
@sanchit-gandhi Does this sound feasible?
Hey @Saltb0xApps @adamfils - this sounds like it would work. The only change that I would make would be using a more powerful audio encoder to extract more meaningful representations from the music conditioning (e.g. warm starting an audio encoder from the HuBERT model to extract music embedding representations). Using DAC or EnCodec alone is only going to provide you with a down-sampled version of the music inputs, rather than any features that encode tempo, key, mood, and rhythm, etc. This is then analogous to what the Flan-T5 encoder does for the text conditioning.
Note that you could use something similar to train a TTS model that has text and voice conditioning as well (just replace the music conditioning with a voice sample in the flowchart above). You could then give it a 2-second voice prompt to control the style of generated voice, and then control how fast/slow or animated/monotonous the speech is using the text prompt.
@sanchit-gandhi / @adamfils Thank you for providing a detailed response! I have a large dataset of around 1k hours of just vocals (separated out of music from soundcloud songs using demucs) and their lyrics/transcription using Whisper. I was wondering if I could take that dataset of just vocals, combined with dataspeech info for style, and retrain parler-tts to only output singing vocals.
The idea is to create a robust singing vocals version of text to speech models using parler-tts that would only generate singing vocals instead of regular speech.
Would this require code level changes as mentioned above, or simply retraining parler-tts on this singing vocals dataset would be a good enough starting point for something that can generate vocals only?
Hey @Saltb0xApps, this would totally work as you only need 3 things from your datasets for the training to work:
- Audio samples
- Transcriptions
- Text conditioning
Parler-TTS is agnostic to the text conditioning and audio samples you're using!
Also, 1K hours should be enough to get a good-enough model from scratch. You can also explore fine-tuning the current model as well as it already learned some acoustics features and to associate text tokens to acoustics sounds!
@ylacombe I curated the dataset of 1000 hours of vocals (mostly english) Would love to hear your thoughts whether this dataset could work - • audio + transcriptions - https://huggingface.co/datasets/AkhilTolani/vocals • transcriptions + dataspeech tags - https://huggingface.co/datasets/AkhilTolani/vocals-stripped-pitch-text-bins-descriptions
I just started a finetune run here on 2xA100 GPUs. Would love to know if my fine-tune parameters are adjusted correctly for the difference in hardware/dataset size - https://wandb.ai/akhiltolani/parler-speech/runs/wkh5eor3/overview?nw=nwuserakhiltolani
@Saltb0xApps awesome please share some of your audio outputs when you can as the training proceeds.
Hey @Saltb0xApps, wow thanks for sharing this! How did you create this dataset out of curiosity?
A few remarks:
- I'm pretty sure the model can learn from your samples but it would surely benefit from more precise tags: for now, it only uses the dataspeech tags but you would probably need more signals about the singing voices (for example, the singing style, the note etc). Maybe you can get some additional features from the way you created your dataset, would you like to describe this a bit more?
- You should also probably modify the prompt to create descriptions a bit. The current one is about delivering speech, whereas you want to deliver singing.
- I'll give a proper look to the HPs ASAP
Thank you also for sharing your logs, they bring a lot of value to the community! I really like your initiative! If you're okay with this, we can probably make a big splash from your model once the results are what we expect. What do you think?
I've listened to some samples, the model seems to get a sense of singing, which is a good sign! It'd probably need some better hyper-parameters though!
Re: Hyper-parameters
- You've got 1k hours of audio, so you'd better raise your global batch size to get closer to the global batch size we've used for training (196), you can do it by setting gradient_accumulation_steps to something aound 6
- You might have to experiment with the learning rate, for now you can leave it as it is but it's a bit too high IMO
Hey @Saltb0xApps, wow thanks for sharing this! How did you create this dataset out of curiosity?
A few remarks:
- I'm pretty sure the model can learn from your samples but it would surely benefit from more precise tags: for now, it only uses the dataspeech tags but you would probably need more signals about the singing voices (for example, the singing style, the note etc). Maybe you can get some additional features from the way you created your dataset, would you like to describe this a bit more?
- You should also probably modify the prompt to create descriptions a bit. The current one is about delivering speech, whereas you want to deliver singing.
- I'll give a proper look to the HPs ASAP
Thank you! The dataset is just off music online. I separated the vocals out using Demucs, used pydub silence detection to chunk them, and then transcribed them using whisper large/medium.
I'm happy to add more info about the singing, but we'll need to come up with some automatic techniques to detect signals like singing style/notes. I was thinking LP-MusicCaps (https://github.com/seungheondoh/lp-music-caps), but open to other ideas for this.
I decided to keep the prompt same and not introduce additional tags since i'm finetuning the base model, instead of training one from scratch. The idea being that the model will basically "sing" instead of "talk" with everything else being the same.
Thank you also for sharing your logs, they bring a lot of value to the community! I really like your initiative! If you're okay with this, we can probably make a big splash from your model once the results are what we expect. What do you think?
Absolutely! I'd love connect & discuss a plan over discord/email. Lmk if that works for you!
I've listened to some samples, the model seems to get a sense of singing, which is a good sign! It'd probably need some better hyper-parameters though!
Re: Hyper-parameters
- You've got 1k hours of audio, so you'd better raise your global batch size to get closer to the global batch size we've used for training (196), you can do it by setting gradient_accumulation_steps to something aound 6
- You might have to experiment with the learning rate, for now you can leave it as it is but it's a bit too high IMO
@ylacombe Ah thank you for sharing this! I definitely agree 1e-4 is too high. I'll do another run with 4e-5 as the lr and updated gradient_accumulation_steps to 6 later today.
@ylacombe I believe parler-tts is going to come up with a larger model very soon if i remember correctly? 600M is comparable to musicgen-small, and i believe there is a major qualitative difference between 3B+ params and 600M params (atleast in musicgen)! I'd imagine funetuning the parler-tts large model would most likely also give much better singing results!
would really appreciate if you could share any details about the large model (param size, dataset size, estimated launch date, etc.) if possible :)
Finetune run 2 with a less aggressive learning rate and gradient_accumulation_steps = 6 https://wandb.ai/akhiltolani/parler-speech/runs/mv9dd4hz/overview?nw=nwuserakhiltolani
Here is a Huggingface space to try out the singing vocals fine-tune of Parler-tts! https://huggingface.co/spaces/AkhilTolani/vocals
-
The model is having a very hard time differentiating between male and female vocals. Maybe its my training dataset?
-
Ideas on how to generate consistently long vocals based on speaker_id in chunks? (2 min+ for practical use?)
-
Need to figure out how to set determine
min_lengthormin_new_tokensbased on input text so that the model doesn't miss a few words. Can do something very rudimentary like input_prompt_word_count*0.8 (or any simple formula that almost gets it right) -
The model occasionally just generates screeching noises with no coherent words. Need to determine why.
Hey @Saltb0xApps, you can reach out to me by mail on yoach [at] huggingface.co! This run is definitely better and I really love the space you've created. Let's discuss offline how we can make this even better. I believe you could use a mix of automatic features and generated features to create better singing voice descriptions.
Also some of the data-speech features might not be adapted here.
- Ideas on how to generate consistently long vocals based on speaker_id in chunks? (2 min+ for practical use?)
This needs a bit of hacking around the model, i.e by adding some names to indicate to your model to keep voice consistency.