charsiu How does the type of audio affect the predictive charsiu aligner. The audio is in wav format

How does the type of audio affect the predictive charsiu aligner. The audio is in wav format

Open DIKSHAAGARWAL2015 opened this issue 2 years ago • 7 comments

Nov 02 '22 23:11 DIKSHAAGARWAL2015

The training data is mostly clean speech, so non-speech audio effects probably will make it even more error-prone. You can test it yourself.

Nov 03 '22 02:11 lingjzhu

When I use my audio's, I get sampling assertion and size error. Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1046752, 2]

Nov 03 '22 14:11 DIKSHAAGARWAL2015

The sampling rate should be 16k and the audio should be mono channel. The input size seems to suggest that your audio is a multi-channel one.

Nov 03 '22 14:11 lingjzhu

The audio becomes multi channel since the data is not lab prepared and directly recorded using micro phone. How I ensure mono channel by using transformations ?

Nov 03 '22 14:11 DIKSHAAGARWAL2015

You can check out librosa: https://librosa.org/doc/main/generated/librosa.load.html

Nov 03 '22 14:11 lingjzhu

When it predicts phonemes to print the corresponding words in audio do I need to add print statements?

Nov 09 '22 18:11 DIKSHAAGARWAL2015

It only gives you phonemes, not words.

Nov 09 '22 21:11 lingjzhu

charsiu charsiu copied to clipboard

How does the type of audio affect the predictive charsiu aligner. The audio is in wav format

charsiu
charsiu copied to clipboard