nmt
nmt copied to clipboard
issue when training on Romanian language
Hi. I have tried to train an NMT model using some files made to look exactly like the ones in the English-Vietnamese example. I have made a set of files for training, one for Romanian and one for English, and two sets of files for testing, two for Romanian and two for English. The data was downloaded from http://www.statmt.org/wmt16/translation-task.html, link from the tutorial.
I also made a set of vocabularies, by selecting the 50k most frequent words from the training data.
But when I try to train this data, I have a tensorflow error (I guess). I installed tf-nightly, as written in the tutorial. For the same sets of files, but en-vi, it works, but in en-ro, it doesn't. Got any help on this issue? What else should I change, besides the command:
python -m nmt.nmt
--src=ro --tgt=en
--vocab_prefix=/tmp/nmt_data/vocab
--train_prefix=/tmp/nmt_data/train
--dev_prefix=/tmp/nmt_data/tst2012
--test_prefix=/tmp/nmt_data/tst2013
--out_dir=/tmp/nmt_model
--num_train_steps=12000
--steps_per_stats=100
--num_layers=2
--num_units=128
--dropout=0.2
--metrics=bleu
I named the files exactly the same and I put it in /tmp/nmt_data.
I also do not understand how the vocab.en files and vocab.vi files (from https://nlp.stanford.edu/projects/nmt/) do not have 50k lines, as it contains the most 50k frequent words... The ones I made for Romanian have 50k line. Maybe here is the issue?
Thanks in advance!
Hi @Alegzandra how's your progress on training this model using Romanian text to English text? I am preparing data for training English to Chinese text. How do you extract the 50k words?