vall-e
vall-e copied to clipboard
Hello. A question about training. Is Force alignment of phoneme to audio before audio encoding necessary?
Or Does the LM handle alignment during the self attention process? I read in the valle paper they use force alignment tools, but I dont see anything in the code.