charsiu icon indicating copy to clipboard operation
charsiu copied to clipboard

How does the type of audio affect the predictive charsiu aligner. The audio is in wav format

Open DIKSHAAGARWAL2015 opened this issue 2 years ago • 7 comments

DIKSHAAGARWAL2015 avatar Nov 02 '22 23:11 DIKSHAAGARWAL2015

The training data is mostly clean speech, so non-speech audio effects probably will make it even more error-prone. You can test it yourself.

lingjzhu avatar Nov 03 '22 02:11 lingjzhu

When I use my audio's, I get sampling assertion and size error. Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 1046752, 2]

DIKSHAAGARWAL2015 avatar Nov 03 '22 14:11 DIKSHAAGARWAL2015

The sampling rate should be 16k and the audio should be mono channel. The input size seems to suggest that your audio is a multi-channel one.

lingjzhu avatar Nov 03 '22 14:11 lingjzhu

The audio becomes multi channel since the data is not lab prepared and directly recorded using micro phone. How I ensure mono channel by using transformations ?

DIKSHAAGARWAL2015 avatar Nov 03 '22 14:11 DIKSHAAGARWAL2015

You can check out librosa: https://librosa.org/doc/main/generated/librosa.load.html

lingjzhu avatar Nov 03 '22 14:11 lingjzhu

When it predicts phonemes to print the corresponding words in audio do I need to add print statements?

DIKSHAAGARWAL2015 avatar Nov 09 '22 18:11 DIKSHAAGARWAL2015

It only gives you phonemes, not words.

lingjzhu avatar Nov 09 '22 21:11 lingjzhu