DeepPhonemizer some character sets don't work

Hi. I'm working on this shared task:

https://github.com/sigmorphon/2022G2PST

Some of the character sets work fine, but others do not, specifically: Persian, Bengali, and Thai.

Persian and Bengali fail when training begins. Thai fails at inference.

Any ideas why this might be so?

I'm appending the error below. The problem seems to be in training/trainer.py.

thank you,

mike h.

(mhenv) mhammond@SBS-7337:~/Dropbox/fromlapper/sigmorphon2022/deep$ python doit.py 
per
{'ن', 'و', 'ج', 'ل', 'ژ', 'س', 'ض', 'ذ', 'ت', 'ه', 'ر', '\u200c', 'ث', 'ظ', 'ش', 'ا', 'ع', 'ئ', 'م', 'غ', 'ە', 'ص', 'ح', 'آ', 'ء', 'پ', 'چ', 'گ', 'خ', 'ف', 'ی', 'ق', 'ز', 'د', 'ک', 'ب'}
2022-05-22 15:26:50,656.656 INFO preprocess:  Preprocessing, train data: with 100 files.
2022-05-22 15:26:50,656.656 INFO preprocess:  Processing train data...
100%|██████████████████████████████████████| 100/100 [00:00<00:00, 86178.43it/s]
2022-05-22 15:26:50,659.659 INFO preprocess:  
Saving datasets to: /home/mhammond/Desktop/datasets
2022-05-22 15:26:50,660.660 INFO preprocess:  Preprocessing. 
Train counts (deduplicated): [('per', 100)]
Val counts (including duplicates): [('per', 56)]
2022-05-22 15:26:50,662.662 INFO train:  Initializing new model from config...
2022-05-22 15:26:50,742.742 INFO train:  Checkpoints will be stored at /home/mhammond/Desktop/checkpoints
Traceback (most recent call last):
  File "doit.py", line 79, in <module>
    train(config_file=lang+'.yaml')
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/train.py", line 57, in train
    trainer.train(model=model,
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in train
    val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/trainer.py", line 89, in <listcomp>
    val_batches = sorted([b for b in val_loader], key=lambda x: -x['text_len'][0])
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 569, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in __iter__
    for idx in self.sampler:
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/dp/training/dataset.py", line 54, in __iter__
    binned_idx = np.stack(bins).reshape(-1)
  File "<__array_function__ internals>", line 180, in stack
  File "/home/mhammond/Desktop/mhenv/lib/python3.8/site-packages/numpy/core/shape_base.py", line 422, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

May 22 '22 22:05 hammondm

Hi, it seems that you only process 100 files? Seems to me that there is not enough data for building up the batches in training.

Jun 09 '22 08:06 cschaefer26

There are 10 language samples of 100 pairs each. The other ones all work fine; it's just the three above that don't. mike h

Jun 09 '22 15:06 hammondm

I see. Its quite few to be honest, the model would need more like 10000-100000 pairs to learn from. Anyway, there seems not enough data to produce a validation batch, maybe the batch size is too large?

Jun 09 '22 15:06 cschaefer26

Hi I certainly agree that 100 pairs is crazy small. BUT, your system works with that size for the other language, just not for the three above. I'm assuming the system has a problem with the character encoding? mike h.

Jun 09 '22 18:06 hammondm

Doesn't seem so, after preprocessing there should be no filtering anymore. Could you share your config file? To me it seems that there are not enough validation samples to build up a batch (maybe the batch size is 64?)

Jun 10 '22 10:06 cschaefer26

As I said above, that can't be the cause. I run this on 10 different languages. Each language has 100 pairs. Only the three languages I cited above fail.

I'm appending the master python file and the template yaml file here.

import yaml,re
from dp.preprocess import preprocess
from dp.train import train
from dp.phonemizer import Phonemizer

pfx = '/data/2022G2PST-main/'
infix = 'data/target_languages/'

langs = ['per'] #'ben ger ita per swe tgl tha ukr'.split()

#not working: ben,per
#later not working: tha

#get master yaml data
with open('master.yaml') as f:
	yamldata = yaml.load(f,Loader=yaml.FullLoader)

#go through the languages one by one
for lang in langs:
	print(lang)
	#get character sets
	alpha = set()
	ipa = set()
	datasets = []
	#go through each data file
	for sfx in ['_dev.tsv','_100_train.tsv','_test.tsv']:
		f = open(pfx+infix+lang+sfx,'r')
		t = f.read()
		f.close()
		#remove empty line
		t = t.split('\n')
		t = t[:-1]
		#save the data
		datasets.append(t)
		#test data has no transcription
		if sfx == '_test.tsv':
			for line in t:
				for letter in line:
					alpha.add(letter)
		#get spelling and transcription for other files
		else:
			for line in t:
				letters,trans = line.split('\t')
				for letter in letters:
					alpha.add(letter)
				for letter in trans:
					ipa.add(letter)
	#get rid of the space in the transcription set
	ipa.remove(' ')
	#update yamldata
	yamldata['preprocessing']['languages'] = [lang]
	yamldata['preprocessing']['text_symbols'] = ''.join(alpha)
	yamldata['preprocessing']['phoneme_symbols'] = ''.join(ipa)
	#yamldata['preprocessing']['text_symbols'] = list(alpha)
	#yamldata['preprocessing']['phoneme_symbols'] = list(ipa)
	#make new yaml file
	with open(lang+'.yaml','w') as f:
   	 yaml.dump(yamldata,stream=f,allow_unicode=True)
	devdata = []
	for line in datasets[0]:
		word,trans = line.split('\t')
		trans = re.sub(' ','',trans)
		devdata.append((lang,word,trans))
	traindata = []
	for line in datasets[1]:
		word,trans = line.split('\t')
		trans = re.sub(' ','',trans)
		traindata.append((lang,word,trans))
	testdata = []
	for line in datasets[2]:
		testdata.append((lang,line))
	preprocess(
		config_file=lang+'.yaml',
		train_data=traindata,
		val_data=devdata,
		deduplicate_train_data=False
	)
	train(config_file=lang+'.yaml')
	phonemizer = Phonemizer.from_checkpoint(
		'/home/mhammond/Desktop/checkpoints/latest_model.pt'
	)
	errors = 0
	#for _,word,trans in devdata:
	for _,word in testdata:
		#trans = re.sub(' ','',trans)
		phonemes = phonemizer(
			word,
			lang=lang
		)
		print(word,phonemes)
		#if phonemes != trans: errors += 1
	#print(f'{lang} errors: {errors}')


paths:
  checkpoint_dir: /home/mhammond/Desktop/checkpoints
  data_dir: /home/mhammond/Desktop/datasets

preprocessing:
  languages: ['de', 'en_us']
  text_symbols: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJ'
  phoneme_symbols: ['a', 'b', 'd', 'e', 'f', 'g', 'h']
  char_repeats: 3
  lowercase: true
  n_val: 5000

model:
  type: 'transformer'
  d_model: 512
  d_fft: 1024
  layers: 6
  dropout: 0.1
  heads: 4

training:
  learning_rate: 0.0001
  warmup_steps: 10000
  scheduler_plateau_factor: 0.5
  scheduler_plateau_patience: 10
  batch_size: 32
  batch_size_val: 32
  epochs: 10
  generate_steps: 10000
  validate_steps: 10000
  checkpoint_steps: 100000
  n_generate_samples: 10
  store_phoneme_dict_in_model: true

Jun 10 '22 13:06 hammondm

Yeah that's odd. You could look into the data dir, there is a combined_dataset.txt that stores all the processed tuples as text (after removing out-of-dict phonemes and chars). If that looks good, you could unpickle the val_dataset.pkl to see if that looks good. Might be that too much is being filtered.

Jun 10 '22 15:06 cschaefer26

DeepPhonemizer DeepPhonemizer copied to clipboard

some character sets don't work

DeepPhonemizer
DeepPhonemizer copied to clipboard