vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Problem of reproducing the quality of the speech from "web_UI"

Open frank613 opened this issue 6 months ago • 3 comments

Hi Ecker,

I managed to generate very decent TTS samples from your web-UI (https://huggingface.co/spaces/ecker/vall-e). I thought at first that web was running based on the v2-reference-models(nemo-smaller/larger-44khz-llama-8). However I failed to generate comparable samples locally with both v2 models. I tried many different combinations of sampling parameters but still the generated speech is either silence, low volume, or incomplete. Now I doubted that it might actually the v1-model is running behind and with which I could generate good samples locally with similar scripts. Could you please inform which is the model used for the web-UI?

If the actual model for the web-UI is v2, then I believe I must have made some mistakes in the script which I would like to paste here. The function that I used for generating the speech is an analog to the part in "inference.py" file.

def get_tts_results(model, text_in, prop_in, lang, device, reps_in, score_masked_only=False, out_path=None): 
    #len
    input_kwargs = dict(
                phns_list=[text_in], 
                raw_text_list=None,
                proms_list=[prop_in],
                lang_list=[lang],
            )
    len_list = model( **input_kwargs, task_list=["len"])
    #tts
    input_kwargs = dict(
                phns_list=[text_in], 
                proms_list=[prop_in],
                lang_list=[lang],
                task_list=["tts"],
                len_list=len_list,
                disable_tqdm=False,
                use_lora=True,
            )
    for i in range(3):
        sampling_kwargs = {"temperature": 1, "cfg_strength": 3, "max_steps":50}
        resps_list_out = model( **input_kwargs, **sampling_kwargs)
        ## decode
        Path(out_path).mkdir(parents=True, exist_ok=True)
        for idx, reps_out in enumerate(resps_list_out):
            wav, sr = qnt.decode_to_file(reps_out, out_path+f"test-full-{i}.wav", device=device) 
    _logger.info(f"decoding done")

I checked the input to those models and they are really in a good shape (at least work for V1 model). Please point out if you could spot anything goes wrong here for V2.

Thanks, Xinwei

frank613 avatar May 21 '25 16:05 frank613

I managed to generate very decent TTS samples from your web-UI (https://huggingface.co/spaces/ecker/vall-e)

Huh, I could have sworn the HF space was in a degraded state. Last I checked there was some error about the ZeroGPU runtime or some such. I guess it fixed itself.

I thought at first that web was running based on the v2-reference-models(nemo-smaller/larger-44khz-llama-8 Could you please inform which is the model used for the web-UI?

By default, if no model is requested, it'll download and load the "reference" ar+nar-len-llama-8 24KHz EnCodec-based model. I think the simplest solution is for me to spin a second space that defaults to the nvidia/audio-codec-44khz-branded model, since HuggingFace spaces are a bit agonizing to cooperate.

However I failed to generate comparable samples locally with both v2 models. I tried many different combinations of sampling parameters but still the generated speech is either silence, low volume, or incomplete.

This is something that I can't really elucidate as I can't recall much on PyTorch-setup-specific problems beyond it seeming to be tied to the specific PyTorch/ROCm version I had set up. I think it was some flavor of stable 2.6.x that had a broken attention mechanism(s), and flash_(sdpa) was the only one that "worked". If I recall right, the nightly version at the time had a different set of problems. Setting up a fresh venv for PyTorch/ROCm 2.7.0 (stable) seemed to magically fix whatever issues I remember having.

  • I can't recall if the typical PyTorch/CUDA releases had any quirks but I wouldn't be surprised if I did have to chase some gremlins pertaining to it.

However, I think the latest weights for the larger variant are inherently broken from doing some experimental training that I think I erroneously uploaded. The prior commit I think worked better in my brief LoRA testing last night, but the revision prior to that might be better.

The smaller variant is probably botched since it was erroneously trained under the afformentioned PyTorch/ROCm with the bugged attention mechanism. When I get the time and mental faculties I'll evaluate it and see about "fixing" it.

My "good-enough" settings yesterday seemed to be just loading the model under bfloat16 with sdpa attention. Sampler settings to their defaults in the web UI, but sometimes checking masked tokens only + remask (or whatever I called them).

Please point out if you could spot anything goes wrong here for V2.

At a glance it seems right. The "interface" for the v1/v2 models shouldn't change.

I would say that the only problem(s) would be:

  • ensuring your inputs passed through are properly tokenized
  • for loading the v2, ensure the model is loaded through --model="./path/to/the/fp32.sft".
    • to shortcut this, ~~some nasty injection into args.sysv before importing anything vall_e is easier than remembering to manually pass this when running the script.~~ export VALLE_DEFAULT_MODEL_NAME=nemo-larger-44khz-llama-8.sft

Whenever I get a chance I can see about validating the model works when interfaced per your script. I typically stick with using the TTS class exposed through vall_e.inference since it handles the finer details.


I've spun up another space that will default to the nvidia/audio-codec-44khz larger branded model: https://huggingface.co/spaces/ecker/vall-e-44khz although the caveat is it's using the possibly worse model. I'll need to do some evals again on what I feel is the better revision.

e-c-k-e-r avatar May 21 '25 17:05 e-c-k-e-r

Hi,

I think I got better results now from local with checked "masked tokens only + remask". Thanks a lot for the suggestions.

But still the quality is not comparable to the v1 model. That's fine because there is a always a space to improve 👍 . I am actually using your model to do some speech assessment research. The idea of FSQ is quite attractive to me for my research because of the way they split the mel-spectrogram. I will keep watching this great work!

Some notes:

bfloat16 with sdpa attention

I load the config from the reference model "fp32.sft", and I checked indeed cfg.weights_name="fp32", do you mean mean one needs to do a weights conversion or something?

ensuring your inputs passed through are properly tokenized

Actually I ran into a trap a few weeks ago when I used all lowercase text input. The reason is it seems that for lowercase input (no punctuations) the tokeniser will generate no "<UNK>" according to its vocabulary and as a result the "g2p" part is skipped. It is the same case with the web-UI as well. I think it worths giving a warning or something.

def encode_text(text, language="auto", precheck=True, phonemize=True ):
    # already a tensor, return it
    if isinstance( text, torch.Tensor ):
        return text
    # check if tokenizes without any unks (for example, if already phonemized text is passes)
    if precheck and "<unk>" in phn_symmap:
        tokens = tokenize( text )
        if phn_symmap["<unk>"] not in tokens:
            return torch.tensor( tokens )

    if not phonemize:
        return torch.tensor( text_tokenize( text ) )

    return torch.tensor( tokenize( g2p.encode(text, language=language) ) )

Thanks!!

frank613 avatar May 22 '25 17:05 frank613

But still the quality is not comparable to the v1 model.

Correct. I think the "reference" model will always have an inherent advantage simply from having a gargantuan amount of training time.

There's also a slight possibility that nvidia/audio-codec-44khz has some inherent quality problems. During my LoRA tests, I noticed the decoded reference clips didn't sound as crisp as the original clips. However, I didn't get to look into it much further, and it could be some weird issue with it being from my ROCm system (as it's dreadfully slow when doing anything through the codec model).

That's fine because there is a always a space to improve 👍

I hope. I just worry that there's a problem inherent to the codec that will make further investment into the codec all for naught. Descript Audio Codec at least was able to reproduce its inherent flaw quickly for me to shelve it, but nvidia/audio-codec-44khz seems to let me down more and more.

Although I suppose it shouldn't hurt to pivot back to EnCodec, even if it does hurt having to give up on the 44KHz dream. EnCodec practically proven itself at least, but I feel it's a bit silly to train a model on the new implementation for it when the reference model on the old implementation is adequate.

  • or if there's an FSQ 24KHz codec. I'm starting to believe 44KHz is too dense for it to serve as the medium for a language model. To my knowledge there still isn't a viable codec-based language model that operates on 44KHz.

I am actually using your model to do some speech assessment research. The idea of FSQ is quite attractive to me for my research because of the way they split the mel-spectrogram. I will keep watching this great work!

As I think I mentioned before, feel free to raise any questions. Despite the model(s) not directly performing to my expectations, I feel at the very least this repo can have some merit through sharing my notes and observations on it. I haven't kept up with much on the literature side of it, but I wouldn't be surprised if it's still lacking.

I load the config from the reference model "fp32.sft", and I checked indeed cfg.weights_name="fp32", do you mean mean one needs to do a weights conversion or something?

Oh right, the misnomer filenames. Despite it being fp32.sft it doesn't really have any bearings. If I remember right, cfg.weights_name just governs what filename to try and load implicitly and export as. It shouldn't matter when explicitly setting a model path, as long as the extension is either .sft or .pth.

The weights themselves on HuggingFace should already be bfloat16 tensors, but you can specify what dtype to cast to and operate under either:

  • web UI: the Settings tab when loading a model
  • from a YAML/cfg: through cfg.inference.weight_dtype (string) or cfg.inference.dtype (torch.dtype, like torch.bfloat16)
  • CLI argument/sys.argv: --dtype=bfloat16

It shouldn't really matter for inferencing, since there doesn't seem to be any consistent difference when inferencing under bfloat16 and float16, but training seems to slowly degrade under bfloat16 for who knows why at this point.

when I used all lowercase text input. The reason is it seems that for lowercase input (no punctuations) the tokeniser will generate no "" according to its vocabulary and as a result the "g2p" part is skipped

Oh, I see the issue. I should be able to replace that check now that I have a vocab of all expected un-phonemized text. I'll try and push a fix when I get the chance later today.

Somewhat related, but I do recall ages ago there being some degradation from the reference model in the output if you don't terminate a sentence with punctuation (blah blah something with the attention heads expecting a punctuating token before the EOS token blah blah). I'm not sure if it still does that for either models.

e-c-k-e-r avatar May 22 '25 18:05 e-c-k-e-r