charsiu
charsiu copied to clipboard
Bug in phoneme to word conversion -- duplicate words
Something seems to be not right with how SIL is used in the word transcriptions.
This is the first example in the LibriSpeech Test set.
Here is the true transcript:
HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT
Here is the forced aligned word transcript:
array([['0.0', '0.23', '[SIL]'],
['0.23', '0.33', 'he'],
['0.33', '0.65', 'began'],
['0.65', '0.69', 'a'],
['0.69', '1.21', 'confused'],
['1.21', '1.62', 'complaint'],
['1.62', '1.93', 'against'],
['1.93', '2.01', 'the'],
['2.01', '2.41', 'wizard'],
['2.41', '2.56', '[SIL]'],
['2.56', '2.57', 'wizard'],
['2.57', '2.63', '[SIL]'],
['2.63', '2.75', 'who'],
['2.75', '2.84', 'had'],
['2.84', '3.26', 'vanished'],
['3.26', '3.59', 'behind'],
['3.59', '3.66', 'the'],
['3.66', '4.02', 'curtain'],
['4.02', '4.15', 'on'],
['4.15', '4.23', 'the'],
['4.23', '4.66', 'left'],
['4.66', '4.89', '[SIL]']], dtype='<U32')
Here is the forced aligned phonetic transcript:
array([['0.0', '0.23', '[SIL]'],
['0.23', '0.3', 'HH'],
['0.3', '0.33', 'IY'],
['0.33', '0.39', 'B'],
['0.39', '0.44', 'IH'],
['0.44', '0.53', 'G'],
['0.53', '0.6', 'AE'],
['0.6', '0.65', 'N'],
['0.65', '0.69', 'AH'],
['0.69', '0.77', 'K'],
['0.77', '0.81', 'AH'],
['0.81', '0.86', 'N'],
['0.86', '0.97', 'F'],
['0.97', '1.02', 'Y'],
['1.02', '1.1', 'UW'],
['1.1', '1.16', 'Z'],
['1.16', '1.21', 'D'],
['1.21', '1.26', 'K'],
['1.26', '1.3', 'AH'],
['1.3', '1.34', 'M'],
['1.34', '1.44', 'P'],
['1.44', '1.49', 'L'],
['1.49', '1.55', 'EY'],
['1.55', '1.58', 'N'],
['1.58', '1.62', 'T'],
['1.62', '1.66', 'AH'],
['1.66', '1.74', 'G'],
['1.74', '1.78', 'EH'],
['1.78', '1.84', 'N'],
['1.84', '1.9', 'S'],
['1.9', '1.93', 'T'],
['1.93', '1.96', 'DH'],
['1.96', '2.01', 'AH'],
['2.01', '2.1', 'W'],
['2.1', '2.15', 'IH'],
['2.15', '2.26', 'Z'],
['2.26', '2.34', 'ER'],
['2.34', '2.41', 'D'],
['2.41', '2.56', '[SIL]'],
['2.56', '2.57', 'D'],
['2.57', '2.63', '[SIL]'],
['2.63', '2.7', 'HH'],
['2.7', '2.75', 'UW'],
['2.75', '2.78', 'HH'],
['2.78', '2.8', 'AE'],
['2.8', '2.84', 'D'],
['2.84', '2.95', 'V'],
['2.95', '3.04', 'AE'],
['3.04', '3.09', 'N'],
['3.09', '3.15', 'IH'],
['3.15', '3.23', 'SH'],
['3.23', '3.26', 'T'],
['3.26', '3.3', 'B'],
['3.3', '3.35', 'IH'],
['3.35', '3.43', 'HH'],
['3.43', '3.53', 'AY'],
['3.53', '3.56', 'N'],
['3.56', '3.59', 'D'],
['3.59', '3.62', 'DH'],
['3.62', '3.66', 'AH'],
['3.66', '3.78', 'K'],
['3.78', '3.9', 'ER'],
['3.9', '3.93', 'T'],
['3.93', '3.96', 'AH'],
['3.96', '4.02', 'N'],
['4.02', '4.09', 'AA'],
['4.09', '4.15', 'N'],
['4.15', '4.19', 'DH'],
['4.19', '4.23', 'AH'],
['4.23', '4.36', 'L'],
['4.36', '4.47', 'EH'],
['4.47', '4.58', 'F'],
['4.58', '4.66', 'T'],
['4.66', '4.89', '[SIL]']], dtype='<U32')
I suspect this may indicate a general problem with the phoneme to word conversion.
Hi,
Thank you! This is indeed a bug in phoneme-to-word conversion. The model performs silence detection and alignment at the same time. The phones and words were first aligned to non-silent audio frames and then merged with the silent frames. This error occurred when silence was detected within a word and it caused a problem when merging non-silent and silent frames. This is indeed a problem and it does not seem to occur very frequently in my earlier tests. I will try to improve it when I have more time. Sorry for the bug.
Any updated?