NeuFA
NeuFA copied to clipboard
phone boundary between continuous vowels
@petronny Hi, I have trained the model using chinese dataset successfully. But I meet a problem, the bounary of continuous vowels is not as correct as other phones. For example "我安心的点点头", phone boundarys between “我” 和 “安”,“o3” and "an1" ,are wrong. And this kind of problems happen frequently.
For syllables like "yun1"(云) I can split to "y vn1" where “y” has a certain duration value, for "wu2"(无) I can split to "w u2" where "w" has a duration value. But for some vowel, for example "安/an" "阿/a”, there is really no consonant at all.
Have you found problems like this, how did you solve the problem?
Well, it would be a superise to me if NeuFA (or any other FA model) predicts some insane boundaries.
Like the paper said, the 50 ms tolerance accuracy of NeuFA is 95% at word level.
It seems to be high. But in practice, for a sentence with 20 phonemes in example. The possibilty that there is a phoneme with a predicted boundary 50ms biased from the ground-truth is 1 - .95 ^ 20 = 64.15%
. Similarly, the possibilty that there is a phoneme with a predicted boundary 100ms biased from the ground-truth is 1 - .98 ^ 20 = 33.24%
.
Also, NeuFA currently doesn't restrict the predicted boundaries to be nonoverlapping (we are working on this in NeuFA 2), which makes the situation even worse.
So my opinion is NeuFA is not ready for production enviroments yet. But NeuFA could be used as a "soft" FA model which extracts the attention weights between the text and speech to map the information between them. And this is exactly why we propose NeuFA and how we use it in our other researches.
Hope this will answer your question.
@petronny Thank you for your reply!
- "nonoverlapping" and "fixed thred=0.5" make boundaries not very clear, and the results are hard to use even though most of the results are really good.
- Can you share the code "extracts the attention weights between the text and speech to map the information between them"?
the results are hard to use even though most of the results are really good.
I agree with that. We are working on the nonoverlapping issue.
Can you share the code to "extract the attention weights between the text and speech to map the information between them"?
See https://github.com/thuhcsi/NeuFA/blob/master/inference.py#L112 , I mainly uses the attention weights from the ASR direction.
Get it! Thank you again! @petronny
I tried w_tts and w_asr at phone level, but the results are both bad since the result of the first phone "silence" of each sentence has a big difference from the ground truth. I did not why。 Then I tried weight = boundary_left - boudary_right for each phone (the weight values are 1 in the middle of the phone, and about 0 in the border of the phone) and ues functions in "https://github.com/as-ideas/DeepForcedAligner/blob/main/dfa/duration_extraction.py" to extract durations. Then I can get a continues , no overlap alignment.
Well , in fact I meet the similar problem with you. In my experiment, align is not even Monotony. Which means end time of a word is earlier than the start time of the word. And this made this great work suits for real scenario I think . Your idea may work I think , thank you
The bad case is like this
intervals [18]:
xmin = 7.36
xmax = 7.48
text = "the"
intervals [19]:
xmin = 7.48
xmax = 7.26 # watch here
text = "assassination"
intervals [20]:
xmin = 7.71
xmax = 7.93
text = "of"
intervals [21]:
xmin = 7.95
@panxin801 May be you can use weight/attn = boundary_left - boundary_right for test. @petronny I also found that in statistics as the paper showed, NeuFA is much better than MFA in my experiment using chinese dataset. But in some cases, the phones boundaries has very large deviation from the groundtruth. “Very large Error” ,for examples, larger than 5 frames, happens more than MFA.
@Liujingxiu23 Yeah, I have the same conclusion with you, Chinese result better than English in average. And thank you for your advice