FastSpeech2 Text and Pitch Matrices of Different Shapes

Hi, I'm trying to train FastSpeech2 on my own data but I'm getting.

Traceback (most recent call last):
  File "../../code/FastSpeech2.git/model/modules.py", line 121, in forward
    x = x + pitch_embedding
RuntimeError: The size of tensor a (69) must match the size of tensor b (130) at non-singleton dimension 1

I've been trying to understand why I'm getting this error. If I look at FastSpeech2 forward() it calls VarianceAdaptor's forward() which embeds the pitch and then tries to add those embeddings to the text embeddings. Here are some shapes of tensors before and after embedding the pitch and the shapes before the RuntimeError.

x.shape, pitch_target.shape, src_mask.shape, p_control
(torch.Size([16, 102, 256]), torch.Size([16, 201]), torch.Size([16, 102]), 1.0)

pitch_prediction, pitch_embedding = self.get_pitch_embedding(
	x, pitch_target, src_mask, p_control
)

pitch_prediction.shape, pitch_embedding.shape
(torch.Size([16, 102]), torch.Size([16, 201, 256]))

x = x + pitch_embedding
x.shape, pitch_embedding.shape
(torch.Size([16, 102, 256]), torch.Size([16, 201, 256]))

We can clearly see that my text embeddings are not of the same shape as the pitch embeddeings which causes the error.

How come are my text and pitch of different shapes? I mean, those are loaded from disk and were created by FastSpeech2 and MFA.

What do I need to change, probably in my preprocessing to get the correct shapes?

May 20 '21 14:05 SamuelLarkin

@SamuelLarkin It seems that the length of your pitch sequence is longer than the length of the phoneme sequence. Did you set preprocessing.pitch.feature in preprocess.yaml to "frame-level" while preprocessing the audio files? If not, maybe you should check that whether the length mismatch is caused by incorrect padding or not.

May 26 '21 07:05 ming024

the best solution is to run "mfa train xxx" command to generate textgrid files again and then run preprocess.py(even run "mfa align xxx" command using align model trained in other dataset with same lexicon may also not work).

May 28 '21 02:05 yileld

I should document my solution

I turns out that my input is not arpabet but when I call preprocess.py it marks my text in curly braces signifying that it is arpabet for latter steps even if it isn't.

The problem lies in the dataset. It reads a pitch and text that have the right sizes but then it goes and does processing on the text that changes its length.

I hack train.txt and val.txt by removing the curly braces.
I've augmented symbols with my own symbols/phones
I've changed line #37 of text/__init__.py in def text_to_sequence(text, cleaner_names): to sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names).split()) aka I added the .split()

May 28 '21 19:05 SamuelLarkin

I turns out that my input is not arpabet but when I call preprocess.py it marks my text in curly braces signifying that it is arpabet for latter steps even if it isn't.

this mean you are training on frame level, not phoneme level?

I've augmented symbols with my own symbols/phones

I dont define my own symbols/phones list, I used the pretrained MFA model on their website. What should I put in symbols.py and cmudict.py file?

I've changed line #37 of text/init.py in def text_to_sequence(text, cleaner_names): to sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names).split()) aka I added the .split()

what is the point of adding this .split()

May 29 '21 02:05 EuphoriaCelestial

FYI I would like to point out that I'm not an expert. I'm new at TTS and I'm still learning.

I still want to train on phoneme.
I trained my own mfa model. I also manually added my own phones to symbols. If you are already using mfa's phones, probably you don't need to add any symbols.
Why .split(), because my input looks like a n: k t on:, so some of my phones are represented with more than one character. If I don't split on space, then my input is handled as an array of character so instead of processing n: the function will handle 2 characters separately: n followed by :. In my case, len(text) != len(text.split()). My pitch matrices are of len(text.split())and notlen(text)`. I haven't tested this yet but I could probably solve my problem by converting my current alphabet into a one-letter phone for all my phones.

My goal was to get FastSpeech2 to train even if it met to hack the pipeline. I don't like my current solution. It is not generic enough for my liking. I'm still working on finding a good solution.

May 31 '21 13:05 SamuelLarkin

i come into the same question

Jun 19 '21 10:06 zhangbo2008

i found why the shape is wrong. because the origin file text/symbols.py has too small symbol i fix it by

""" from https://github.com/keithito/tacotron """

"""
Defines the set of symbols used in text input to the model.

The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. """

from text import cmudict, pinyin

_pad = "_"
_punctuation = "!'(),.:;? "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
_silences = ["@sp", "@spn", "@sil"]

# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ["@" + s for s in cmudict.valid_symbols]



#==========加上自己字典的特殊pinyin符号.

with open('madarin_lexicon.txt') as f:
    tmp=f.readlines()
    tmp=[i.strip().split(' ')[1:] for i in tmp]
tmp2=[]
for i in tmp:
    tmp2+=i
print(tmp2)
tmp2=list(set(tmp2))
print(len(tmp2))



_pinyin = ["@" + s for s in pinyin.valid_symbols]#===========这个地方要自己添加.
print('old',len(_pinyin))
print(_pinyin)
_pinyin += ["@" + s for s in tmp2]#===========这个地方要自己添加.

print(_pinyin)
print('new',len(_pinyin))
pass
print(1)




# Export all symbols:
symbols = (
    [_pad]
    + list(_special)
    + list(_punctuation)
    + list(_letters)
    + _arpabet
    + _pinyin
    + _silences
)
# print("打印全的不symbols",symbols)
with open("当前使用的symbols是",'w')as f :
    f.write(str(symbols))
#=============symbols要自己手动加入自己需要的汉语拼音才行!!!!!!!!

and download the https://github.com/Jackiexiao/MTTS/blob/master/misc/mandarin-for-montreal-forced-aligner-pre-trained-model.lexicon

Jun 20 '21 05:06 zhangbo2008

https://github.com/zhangbo2008/fastSpeeck2_chinese_train here is my version of fixed and have some small data just run main is ok.

Jun 20 '21 05:06 zhangbo2008

@zhangbo2008 Hi, I have fixed this with your way and our VietNam Dataset but it still shows: RuntimeError: The size of tensor a (92) must match the size of tensor b (95) at non-singleton dimension 1 This is my code after resolving:

from text import cmudict, pinyin

_pad = "_"
_punctuation = "!'(),.:;? "
_special = "-"
_letters = "aàáảãạăằắẳẵặâầấẩẫậbcdđeèéẻẽẹêềếểễệghiìíỉĩịklmnoòóỏõọôồốổỗộơờớởỡợpqrstuùúủũụưừứửữựvxyỳýỷỹỵ"
_silences = ["@sp", "@spn", "@sil"]

# added from chinese repo https://github.com/ming024/FastSpeech2/issues/66
with open('/mnt/9365f469-af3c-437f-9a58-546628b1869a/fastspeech2/FastSpeech2/lexicon/vietnamese_lexicon.txt') as f:
    tmp=f.readlines()
    tmp=[i.strip().split(' ')[1:] for i in tmp]
tmp2=[]
for i in tmp:
    tmp2+=i
tmp2=list(set(tmp2))
print(len(tmp2))

_pinyin = ["@" + s for s in pinyin.valid_symbols]
print('old',len(_pinyin))
_pinyin += ["@" + s for s in tmp2]
print('new',len(_pinyin))

# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ["@" + s for s in cmudict.valid_symbols]

# Export all symbols:
symbols = (
    [_pad]
    + list(_special)
    + list(_punctuation)
    + list(_letters)
    + _silences
    + _pinyin
    + _arpabet
)

Besides editing symbol.py, Are there any other files that you editted? Thank you so much

Jun 22 '21 07:06 EuphoriaCelestial

i didn't edit other files. you can debug your code . and check whether the pheonm size == pitch size. i solve this problem in the mandarin training. You shold check whether the pheonm you get all in the variable symbols. you can see all your pheonm in the *.TextGrid file.

Jun 23 '21 01:06 zhangbo2008

trainPic here is my mandarin training screenshot

Jun 23 '21 02:06 zhangbo2008

i didn't edit other files. you can debug your code . and check whether the pheonm size == pitch size. i solve this problem in the mandarin training. You shold check whether the pheonm you get all in the variable symbols. you can see all your pheonm in the *.TextGrid file.

@zhangbo2008 how can I check phoneme size and pitch size? I guess we cant just read .TextGrid file one by one

Jun 23 '21 03:06 EuphoriaCelestial

btw, can you sum up all the steps you have done to train this model in Mandarin, maybe it will help one way or another

Jun 23 '21 03:06 EuphoriaCelestial

it seems the pitch embedding is not working properly. I have printed out x and pitch_embedding, they are not having the same shape, therefore they can's be added and caused this error https://github.com/ming024/FastSpeech2/blob/master/model/modules.py#L121

please take a look at this issue @ming024

Jun 23 '21 07:06 EuphoriaCelestial

https://github.com/zhangbo2008/fastSpeeck2_chinese_train you can check my project. That is the version of mandarin can run.

Jun 23 '21 12:06 zhangbo2008

https://github.com/zhangbo2008/fastSpeeck2_chinese_train you can check my project. That is the version of mandarin can run.

I can't run your repo. Please write a guideline.

Jun 24 '21 04:06 EuphoriaCelestial

main.py

Jun 25 '21 07:06 zhangbo2008

main.py

I can't just run that file, there is no .lab, .TextGrid, pitch, energy files...

Jun 25 '21 08:06 EuphoriaCelestial

https://github.com/zhangbo2008/fastSpeeck2_chinese_train/tree/main/raw_path/AISHELL-3-Sample/SSB1711 i save data in here my friend.

Jun 27 '21 02:06 zhangbo2008

@zhangbo2008 I train the model successfully, but I test model is wrong,, I find that the text does not match the text in the audio, Is there any file that needs to be modified?

Jul 30 '21 02:07 Zhang-Nian

@zhangbo2008 I train the model successfully, but I test model is wrong,, I find that the text does not match the text in the audio, Is there any file that needs to be modified?

i haven't trained. it is too time consuming. you can used the pretrained-weight provided instead. i just make my code run correct.

Aug 06 '21 00:08 zhangbo2008

i found why the shape is wrong. because the origin file text/symbols.py has too small symbol i fix it by

Thank you! This script solved my problem.

Sep 30 '21 04:09 jerryuhoo

FastSpeech2 FastSpeech2 copied to clipboard

Text and Pitch Matrices of Different Shapes

FastSpeech2
FastSpeech2 copied to clipboard