FastSpeech2
FastSpeech2 copied to clipboard
Text and Pitch Matrices of Different Shapes
Hi,
I'm trying to train FastSpeech2
on my own data but I'm getting.
Traceback (most recent call last):
File "../../code/FastSpeech2.git/model/modules.py", line 121, in forward
x = x + pitch_embedding
RuntimeError: The size of tensor a (69) must match the size of tensor b (130) at non-singleton dimension 1
I've been trying to understand why I'm getting this error. If I look at FastSpeech2
forward()
it calls VarianceAdaptor
's forward()
which embeds the pitch and then tries to add those embeddings to the text embeddings. Here are some shapes of tensors before and after embedding the pitch and the shapes before the RuntimeError.
x.shape, pitch_target.shape, src_mask.shape, p_control
(torch.Size([16, 102, 256]), torch.Size([16, 201]), torch.Size([16, 102]), 1.0)
pitch_prediction, pitch_embedding = self.get_pitch_embedding(
x, pitch_target, src_mask, p_control
)
pitch_prediction.shape, pitch_embedding.shape
(torch.Size([16, 102]), torch.Size([16, 201, 256]))
x = x + pitch_embedding
x.shape, pitch_embedding.shape
(torch.Size([16, 102, 256]), torch.Size([16, 201, 256]))
We can clearly see that my text embeddings
are not of the same shape as the pitch embeddeings
which causes the error.
How come are my text and pitch of different shapes? I mean, those are loaded from disk and were created by FastSpeech2
and MFA
.
What do I need to change, probably in my preprocessing to get the correct shapes?
@SamuelLarkin It seems that the length of your pitch sequence is longer than the length of the phoneme sequence. Did you set preprocessing.pitch.feature
in preprocess.yaml
to "frame-level"
while preprocessing the audio files? If not, maybe you should check that whether the length mismatch is caused by incorrect padding or not.
the best solution is to run "mfa train xxx" command to generate textgrid files again and then run preprocess.py(even run "mfa align xxx" command using align model trained in other dataset with same lexicon may also not work).
I should document my solution
I turns out that my input is not arpabet but when I call preprocess.py
it marks my text in curly braces signifying that it is arpabet for latter steps even if it isn't.
The problem lies in the dataset. It reads a pitch and text that have the right sizes but then it goes and does processing on the text that changes its length.
- I hack
train.txt
andval.txt
by removing the curly braces. - I've augmented
symbols
with my own symbols/phones - I've changed line #37 of
text/__init__.py
indef text_to_sequence(text, cleaner_names):
tosequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names).split())
aka I added the.split()
I turns out that my input is not arpabet but when I call preprocess.py it marks my text in curly braces signifying that it is arpabet for latter steps even if it isn't.
this mean you are training on frame level, not phoneme level?
I've augmented symbols with my own symbols/phones
I dont define my own symbols/phones list, I used the pretrained MFA model on their website. What should I put in symbols.py
and cmudict.py
file?
I've changed line #37 of text/init.py in def text_to_sequence(text, cleaner_names): to sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names).split()) aka I added the .split()
what is the point of adding this .split()
FYI I would like to point out that I'm not an expert. I'm new at TTS and I'm still learning.
-
I still want to train on phoneme.
-
I trained my own
mfa
model. I also manually added my own phones tosymbols
. If you are already usingmfa
's phones, probably you don't need to add any symbols. -
Why
.split()
, because my input looks likea n: k t on:
, so some of my phones are represented with more than one character. If I don't split on space, then my input is handled as an array of character so instead of processingn:
the function will handle 2 characters separately:n
followed by:
. In my case,len(text) != len(text.split()). My pitch matrices are of
len(text.split())and not
len(text)`. I haven't tested this yet but I could probably solve my problem by converting my current alphabet into a one-letter phone for all my phones.
My goal was to get FastSpeech2 to train even if it met to hack the pipeline. I don't like my current solution. It is not generic enough for my liking. I'm still working on finding a good solution.
i come into the same question
i found why the shape is wrong. because the origin file text/symbols.py has too small symbol i fix it by
""" from https://github.com/keithito/tacotron """
"""
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. """
from text import cmudict, pinyin
_pad = "_"
_punctuation = "!'(),.:;? "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_silences = ["@sp", "@spn", "@sil"]
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ["@" + s for s in cmudict.valid_symbols]
#==========加上自己字典的特殊pinyin符号.
with open('madarin_lexicon.txt') as f:
tmp=f.readlines()
tmp=[i.strip().split(' ')[1:] for i in tmp]
tmp2=[]
for i in tmp:
tmp2+=i
print(tmp2)
tmp2=list(set(tmp2))
print(len(tmp2))
_pinyin = ["@" + s for s in pinyin.valid_symbols]#===========这个地方要自己添加.
print('old',len(_pinyin))
print(_pinyin)
_pinyin += ["@" + s for s in tmp2]#===========这个地方要自己添加.
print(_pinyin)
print('new',len(_pinyin))
pass
print(1)
# Export all symbols:
symbols = (
[_pad]
+ list(_special)
+ list(_punctuation)
+ list(_letters)
+ _arpabet
+ _pinyin
+ _silences
)
# print("打印全的不symbols",symbols)
with open("当前使用的symbols是",'w')as f :
f.write(str(symbols))
#=============symbols要自己手动加入自己需要的汉语拼音才行!!!!!!!!
and download the https://github.com/Jackiexiao/MTTS/blob/master/misc/mandarin-for-montreal-forced-aligner-pre-trained-model.lexicon
https://github.com/zhangbo2008/fastSpeeck2_chinese_train here is my version of fixed and have some small data just run main is ok.
@zhangbo2008 Hi, I have fixed this with your way and our VietNam Dataset but it still shows: RuntimeError: The size of tensor a (92) must match the size of tensor b (95) at non-singleton dimension 1 This is my code after resolving:
from text import cmudict, pinyin
_pad = "_"
_punctuation = "!'(),.:;? "
_special = "-"
_letters = "aàáảãạăằắẳẵặâầấẩẫậbcdđeèéẻẽẹêềếểễệghiìíỉĩịklmnoòóỏõọôồốổỗộơờớởỡợpqrstuùúủũụưừứửữựvxyỳýỷỹỵ"
_silences = ["@sp", "@spn", "@sil"]
# added from chinese repo https://github.com/ming024/FastSpeech2/issues/66
with open('/mnt/9365f469-af3c-437f-9a58-546628b1869a/fastspeech2/FastSpeech2/lexicon/vietnamese_lexicon.txt') as f:
tmp=f.readlines()
tmp=[i.strip().split(' ')[1:] for i in tmp]
tmp2=[]
for i in tmp:
tmp2+=i
tmp2=list(set(tmp2))
print(len(tmp2))
_pinyin = ["@" + s for s in pinyin.valid_symbols]
print('old',len(_pinyin))
_pinyin += ["@" + s for s in tmp2]
print('new',len(_pinyin))
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ["@" + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = (
[_pad]
+ list(_special)
+ list(_punctuation)
+ list(_letters)
+ _silences
+ _pinyin
+ _arpabet
)
Besides editing symbol.py, Are there any other files that you editted? Thank you so much
i didn't edit other files. you can debug your code . and check whether the pheonm size == pitch size. i solve this problem in the mandarin training. You shold check whether the pheonm you get all in the variable symbols. you can see all your pheonm in the *.TextGrid file.

i didn't edit other files. you can debug your code . and check whether the pheonm size == pitch size. i solve this problem in the mandarin training. You shold check whether the pheonm you get all in the variable symbols. you can see all your pheonm in the *.TextGrid file.
@zhangbo2008 how can I check phoneme size and pitch size? I guess we cant just read .TextGrid file one by one
btw, can you sum up all the steps you have done to train this model in Mandarin, maybe it will help one way or another
it seems the pitch embedding is not working properly. I have printed out x and pitch_embedding, they are not having the same shape, therefore they can's be added and caused this error https://github.com/ming024/FastSpeech2/blob/master/model/modules.py#L121
please take a look at this issue @ming024
https://github.com/zhangbo2008/fastSpeeck2_chinese_train you can check my project. That is the version of mandarin can run.
https://github.com/zhangbo2008/fastSpeeck2_chinese_train you can check my project. That is the version of mandarin can run.
I can't run your repo. Please write a guideline.
main.py
main.py
I can't just run that file, there is no .lab, .TextGrid, pitch, energy files...
https://github.com/zhangbo2008/fastSpeeck2_chinese_train/tree/main/raw_path/AISHELL-3-Sample/SSB1711 i save data in here my friend.
@zhangbo2008 I train the model successfully, but I test model is wrong,, I find that the text does not match the text in the audio, Is there any file that needs to be modified?
@zhangbo2008 I train the model successfully, but I test model is wrong,, I find that the text does not match the text in the audio, Is there any file that needs to be modified?
i haven't trained. it is too time consuming. you can used the pretrained-weight provided instead. i just make my code run correct.
i found why the shape is wrong. because the origin file text/symbols.py has too small symbol i fix it by
Thank you! This script solved my problem.