vall-e Samples of finetuned loras

If possible could you share audio samples of the lora finetunes?

Thank you for you time!

Oct 07 '24 17:10 kunibald413

Sure, I keep forgetting to provide samples / a demo page for them.

However, I do need to find some speakers that the model has not trained against at all, since most of the speakers I do have already do decently-ish against the base model.

I should have a pool of speakers I've culled from my dataset ages ago that I can probably make some LoRAs of. It shouldn't be too much of a pain to transcribe + process them tonight.

Oct 08 '24 22:10 e-c-k-e-r

Quite a few things unforeseen problems.

Under my 7900XTX, I can't transcribe things under WhisperX; something to do with CTranslate2 needing a ROCm variant, and such variant is throwing errors about missing Intel OpenMP libs I can't seem to source, so I'm pretty much stuck with having to dig up remnants from my old dataset a year ago. I have transcriptions of some Cyberpunk 2077 characters, so I'm having to use that to demonstrate LoRAs for speakers the model has not seen.
Creating demo pages is actually quite the chore, so I settled with just grabbing evaluation / validation outputs during LoRA training (I forgot I did this before anyways), so these outputs are from that. This seems to only really benefit the Cyberpunk 2077 LoRAs I trained earlier, as the validation outputs do demonstrate typical use where the input transcription are outside the trained dataset. The other LoRAs are either underbaked and lack the "any voice as input will output the target voice the LoRA was trained on" property, or don't sound as well as they can be during normal inferencing use.
- As an aside, I do need to delve more on that property, since I want to see if it only ignores how the input audio prompt's voice sounds, and tries to retain prosidy and what-not, or it just ignores it completely.
The LoRAs are a bit sloppy. The Cyberpunk 2077 LoRAs are flawed with having improperly trimmed utterances (since it was sourced from a hour+ long YouTube video, and the timestamps to slice were too tight a year ago), so utterances for those are cut off a bit at the end. It's strange, since I would have expected that not to be an issue since it's just a LoRA, and the base model is still intact.

For the meantime, I'll provide the sample outputs for the Cyberpunk 2077 LoRAs, as they demonstrate LoRAs the best despite being half-baked: Cyberpunk.zip

It would've probably been better for me just to play around in the web UI and pick some outputs that sound fine.

Tomorrow (or Friday) I'll see about:

adding a "LoRA" mode for vall_e.demo and vall_e.train --eval, where both will source input text transcriptions from the validation dataset, but input audio prompts from the training dataset, to give a better representation of real-world use.
- vall_e.demo will also handle comparing between un-LoRA'd output and LoRA'd output.
- I would say the change for the evaluation / validation pass during training would benefit from also logging the loss for it better, but it actually wouldn't work as there's no reference waveform to naively compare against (I would need a different metric instead).
Getting better sources for LoRAs to demonstrate against.

Oct 10 '24 04:10 e-c-k-e-r

Alright, much better workflow now. I don't feel like I'm nearly losing my sanity as much as yesterday with trying to get samples cobbled together. Although:

it does take a bit of time to generate the demo utterances, since unlike vall_e.train --eval, the demo page doesn't batch.
combining demo pages is a bit of a chore, but not the end of the world.
re-processing the original demo page samples seems to have regressions, which is strange since it shouldn't be performing terribly.
I still can't get the System Shock 2 / SHODAN LoRA to perform decently, despite the other day getting a decent output without much problems.

Samples for LoRAs I have should be available here: https://vall-e-demo.ecker.tech/loras.html

I'll see if I can get more speakers trained and sampled, although I don't really have any ideas on who that aren't already in the training dataset.

Oct 11 '24 01:10 e-c-k-e-r

Thank you, as always appreciate your detailed documentation.

Oct 11 '24 09:10 kunibald413