charsiu Bug in phoneme to word conversion -- duplicate words

Something seems to be not right with how SIL is used in the word transcriptions.

This is the first example in the LibriSpeech Test set.

Here is the true transcript:

HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT

Here is the forced aligned word transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.33', 'he'],
       ['0.33', '0.65', 'began'],
       ['0.65', '0.69', 'a'],
       ['0.69', '1.21', 'confused'],
       ['1.21', '1.62', 'complaint'],
       ['1.62', '1.93', 'against'],
       ['1.93', '2.01', 'the'],
       ['2.01', '2.41', 'wizard'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'wizard'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.75', 'who'],
       ['2.75', '2.84', 'had'],
       ['2.84', '3.26', 'vanished'],
       ['3.26', '3.59', 'behind'],
       ['3.59', '3.66', 'the'],
       ['3.66', '4.02', 'curtain'],
       ['4.02', '4.15', 'on'],
       ['4.15', '4.23', 'the'],
       ['4.23', '4.66', 'left'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

Here is the forced aligned phonetic transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.3', 'HH'],
       ['0.3', '0.33', 'IY'],
       ['0.33', '0.39', 'B'],
       ['0.39', '0.44', 'IH'],
       ['0.44', '0.53', 'G'],
       ['0.53', '0.6', 'AE'],
       ['0.6', '0.65', 'N'],
       ['0.65', '0.69', 'AH'],
       ['0.69', '0.77', 'K'],
       ['0.77', '0.81', 'AH'],
       ['0.81', '0.86', 'N'],
       ['0.86', '0.97', 'F'],
       ['0.97', '1.02', 'Y'],
       ['1.02', '1.1', 'UW'],
       ['1.1', '1.16', 'Z'],
       ['1.16', '1.21', 'D'],
       ['1.21', '1.26', 'K'],
       ['1.26', '1.3', 'AH'],
       ['1.3', '1.34', 'M'],
       ['1.34', '1.44', 'P'],
       ['1.44', '1.49', 'L'],
       ['1.49', '1.55', 'EY'],
       ['1.55', '1.58', 'N'],
       ['1.58', '1.62', 'T'],
       ['1.62', '1.66', 'AH'],
       ['1.66', '1.74', 'G'],
       ['1.74', '1.78', 'EH'],
       ['1.78', '1.84', 'N'],
       ['1.84', '1.9', 'S'],
       ['1.9', '1.93', 'T'],
       ['1.93', '1.96', 'DH'],
       ['1.96', '2.01', 'AH'],
       ['2.01', '2.1', 'W'],
       ['2.1', '2.15', 'IH'],
       ['2.15', '2.26', 'Z'],
       ['2.26', '2.34', 'ER'],
       ['2.34', '2.41', 'D'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'D'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.7', 'HH'],
       ['2.7', '2.75', 'UW'],
       ['2.75', '2.78', 'HH'],
       ['2.78', '2.8', 'AE'],
       ['2.8', '2.84', 'D'],
       ['2.84', '2.95', 'V'],
       ['2.95', '3.04', 'AE'],
       ['3.04', '3.09', 'N'],
       ['3.09', '3.15', 'IH'],
       ['3.15', '3.23', 'SH'],
       ['3.23', '3.26', 'T'],
       ['3.26', '3.3', 'B'],
       ['3.3', '3.35', 'IH'],
       ['3.35', '3.43', 'HH'],
       ['3.43', '3.53', 'AY'],
       ['3.53', '3.56', 'N'],
       ['3.56', '3.59', 'D'],
       ['3.59', '3.62', 'DH'],
       ['3.62', '3.66', 'AH'],
       ['3.66', '3.78', 'K'],
       ['3.78', '3.9', 'ER'],
       ['3.9', '3.93', 'T'],
       ['3.93', '3.96', 'AH'],
       ['3.96', '4.02', 'N'],
       ['4.02', '4.09', 'AA'],
       ['4.09', '4.15', 'N'],
       ['4.15', '4.19', 'DH'],
       ['4.19', '4.23', 'AH'],
       ['4.23', '4.36', 'L'],
       ['4.36', '4.47', 'EH'],
       ['4.47', '4.58', 'F'],
       ['4.58', '4.66', 'T'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

I suspect this may indicate a general problem with the phoneme to word conversion.

Jul 19 '22 23:07 jhkonan

Hi,

Thank you! This is indeed a bug in phoneme-to-word conversion. The model performs silence detection and alignment at the same time. The phones and words were first aligned to non-silent audio frames and then merged with the silent frames. This error occurred when silence was detected within a word and it caused a problem when merging non-silent and silent frames. This is indeed a problem and it does not seem to occur very frequently in my earlier tests. I will try to improve it when I have more time. Sorry for the bug.

Jul 20 '22 02:07 lingjzhu

Any updated?

Nov 12 '23 11:11 teinhonglo

charsiu charsiu copied to clipboard

Bug in phoneme to word conversion -- duplicate words

charsiu
charsiu copied to clipboard