Align Larger Audio File
Hi @cschaefer26, You have done nice job. I'm using your repo. But while aligning larger audio (> 1 minute) with its character (phone) sequence at inference period, the number of predicted values in duration file (. npy file) does not match with the number of characters (phones) that I input with the audio file. What is the problem here? I want to use pretrained model (trained on bangla dataset [audio, phoneme sequence] ) for phoneme duration prediction.So accuracy is a major concern for me.
Note that: While training, I have used 10-15 second larger audio files and corresponding transcriptions (phoneme sequences). And I customized your code (preprocess.py and extract_durations.py) to fit the inference for single audio and its transcription.
Hi, did you ensure that all the audio files were preprocessed before training? Because the preprocessing builds up a phoneme sett from the training data. I'd suspect that you apply the model to new files with unknown phonemes that get filtered out (that's just a guess).
Hi @cschaefer26 , your guess is correct. I applied the model with new files containing unknown phonemes. Thanks for your reply. However, when I want to align an audio (with intermediate silences which are actually inherent) and its phoneme sequence, the accuracy of predicted durations for phones is quite low. As intermediate silence parts are merged with phones' duration. Any suggestion please......