DeepPhonemizer
DeepPhonemizer copied to clipboard
some character sets don't work
Hi. I'm working on this shared task:
https://github.com/sigmorphon/2022G2PST
Some of the character sets work fine, but others do not, specifically: Persian, Bengali, and Thai.
Persian and Bengali fail when training begins. Thai fails at inference.
Any ideas why this might be so?
I'm appending the error below. The problem seems to be in training/trainer.py.
thank you,
mike h.
(mhenv) mhammond@SBS-7337:~/Dropbox/fromlapper/sigmorphon2022/deep$ python doit.py
per
{'ن', 'و', 'ج', 'ل', 'ژ', 'س', 'ض', 'ذ', 'ت', 'ه', 'ر', '\u200c', 'ث', 'ظ', 'ش', 'ا', 'ع', 'ئ', 'م', 'غ', 'ە', 'ص', 'ح', 'آ', 'ء', 'پ', 'چ', 'گ', 'خ', 'ف', 'ی', 'ق', 'ز', 'د', 'ک', 'ب'}
2022-05-22 15:26:50,656.656 INFO preprocess: Preprocessing, train data: with 100 files.
2022-05-22 15:26:50,656.656 INFO preprocess: Processing train data...
100%|██████████████████████████████████████| 100/100 [00:00<00:00, 86178.43it/s]
2022-05-22 15:26:50,659.659 INFO preprocess:
Saving datasets to: /home/mhammond/Desktop/datasets
2022-05-22 15:26:50,660.660 INFO preprocess: Preprocessing.
Train counts (deduplicated): [('per', 100)]
Val counts (including duplicates): [('per', 56)]
2022-05-22 15:26:50,662.662 INFO train: Initializing new model from config...
2022-05-22 15:26:50,742.742 INFO train: Checkpoints will be stored at /home/mhammond/Desktop/checkpoints
Traceback (most recent call last):
File "doit.py", line 79, in <module>
train(config_file=lang+'.yaml')
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/train.py", line 57, in train
trainer.train(model=model,
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in train
val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in <listcomp>
val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 569, in _next_data
index = self._next_index() # may raise StopIteration
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in __iter__
for idx in self.sampler:
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/dataset.py", line 54, in __iter__
binned_idx = np.stack(bins).reshape(-1)
File "<__array_function__ internals>", line 180, in stack
File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/numpy/core/shape_base.py", line 422, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack
Hi, it seems that you only process 100 files? Seems to me that there is not enough data for building up the batches in training.
There are 10 language samples of 100 pairs each. The other ones all work fine; it's just the three above that don't. mike h
I see. Its quite few to be honest, the model would need more like 10000-100000 pairs to learn from. Anyway, there seems not enough data to produce a validation batch, maybe the batch size is too large?
Hi I certainly agree that 100 pairs is crazy small. BUT, your system works with that size for the other language, just not for the three above. I'm assuming the system has a problem with the character encoding? mike h.
Doesn't seem so, after preprocessing there should be no filtering anymore. Could you share your config file? To me it seems that there are not enough validation samples to build up a batch (maybe the batch size is 64?)
As I said above, that can't be the cause. I run this on 10 different languages. Each language has 100 pairs. Only the three languages I cited above fail.
I'm appending the master python file and the template yaml file here.
import yaml,re
from dp.preprocess import preprocess
from dp.train import train
from dp.phonemizer import Phonemizer
pfx = '/data/2022G2PST-main/'
infix = 'data/target_languages/'
langs = ['per'] #'ben ger ita per swe tgl tha ukr'.split()
#not working: ben,per
#later not working: tha
#get master yaml data
with open('master.yaml') as f:
yamldata = yaml.load(f,Loader=yaml.FullLoader)
#go through the languages one by one
for lang in langs:
print(lang)
#get character sets
alpha = set()
ipa = set()
datasets = []
#go through each data file
for sfx in ['_dev.tsv','_100_train.tsv','_test.tsv']:
f = open(pfx+infix+lang+sfx,'r')
t = f.read()
f.close()
#remove empty line
t = t.split('\n')
t = t[:-1]
#save the data
datasets.append(t)
#test data has no transcription
if sfx == '_test.tsv':
for line in t:
for letter in line:
alpha.add(letter)
#get spelling and transcription for other files
else:
for line in t:
letters,trans = line.split('\t')
for letter in letters:
alpha.add(letter)
for letter in trans:
ipa.add(letter)
#get rid of the space in the transcription set
ipa.remove(' ')
#update yamldata
yamldata['preprocessing']['languages'] = [lang]
yamldata['preprocessing']['text_symbols'] = ''.join(alpha)
yamldata['preprocessing']['phoneme_symbols'] = ''.join(ipa)
#yamldata['preprocessing']['text_symbols'] = list(alpha)
#yamldata['preprocessing']['phoneme_symbols'] = list(ipa)
#make new yaml file
with open(lang+'.yaml','w') as f:
yaml.dump(yamldata,stream=f,allow_unicode=True)
devdata = []
for line in datasets[0]:
word,trans = line.split('\t')
trans = re.sub(' ','',trans)
devdata.append((lang,word,trans))
traindata = []
for line in datasets[1]:
word,trans = line.split('\t')
trans = re.sub(' ','',trans)
traindata.append((lang,word,trans))
testdata = []
for line in datasets[2]:
testdata.append((lang,line))
preprocess(
config_file=lang+'.yaml',
train_data=traindata,
val_data=devdata,
deduplicate_train_data=False
)
train(config_file=lang+'.yaml')
phonemizer = Phonemizer.from_checkpoint(
'/home/mhammond/Desktop/checkpoints/latest_model.pt'
)
errors = 0
#for _,word,trans in devdata:
for _,word in testdata:
#trans = re.sub(' ','',trans)
phonemes = phonemizer(
word,
lang=lang
)
print(word,phonemes)
#if phonemes != trans: errors += 1
#print(f'{lang} errors: {errors}')
paths:
checkpoint_dir: /home/mhammond/Desktop/checkpoints
data_dir: /home/mhammond/Desktop/datasets
preprocessing:
languages: ['de', 'en_us']
text_symbols: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJ'
phoneme_symbols: ['a', 'b', 'd', 'e', 'f', 'g', 'h']
char_repeats: 3
lowercase: true
n_val: 5000
model:
type: 'transformer'
d_model: 512
d_fft: 1024
layers: 6
dropout: 0.1
heads: 4
training:
learning_rate: 0.0001
warmup_steps: 10000
scheduler_plateau_factor: 0.5
scheduler_plateau_patience: 10
batch_size: 32
batch_size_val: 32
epochs: 10
generate_steps: 10000
validate_steps: 10000
checkpoint_steps: 100000
n_generate_samples: 10
store_phoneme_dict_in_model: true
Yeah that's odd. You could look into the data dir, there is a combined_dataset.txt that stores all the processed tuples as text (after removing out-of-dict phonemes and chars). If that looks good, you could unpickle the val_dataset.pkl to see if that looks good. Might be that too much is being filtered.