charsiu
charsiu copied to clipboard
How does the type of audio affect the predictive charsiu aligner. The audio is in wav format
The training data is mostly clean speech, so non-speech audio effects probably will make it even more error-prone. You can test it yourself.
When I use my audio's, I get sampling assertion and size error. Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1046752, 2]
The sampling rate should be 16k and the audio should be mono channel. The input size seems to suggest that your audio is a multi-channel one.
The audio becomes multi channel since the data is not lab prepared and directly recorded using micro phone. How I ensure mono channel by using transformations ?
You can check out librosa
: https://librosa.org/doc/main/generated/librosa.load.html
When it predicts phonemes to print the corresponding words in audio do I need to add print statements?
It only gives you phonemes, not words.