nmt issue when training on Romanian language

issue when training on Romanian language

Open Alegzandra opened this issue 5 years ago • 1 comments

Hi. I have tried to train an NMT model using some files made to look exactly like the ones in the English-Vietnamese example. I have made a set of files for training, one for Romanian and one for English, and two sets of files for testing, two for Romanian and two for English. The data was downloaded from http://www.statmt.org/wmt16/translation-task.html, link from the tutorial. I also made a set of vocabularies, by selecting the 50k most frequent words from the training data. But when I try to train this data, I have a tensorflow error (I guess). I installed tf-nightly, as written in the tutorial. For the same sets of files, but en-vi, it works, but in en-ro, it doesn't. Got any help on this issue? What else should I change, besides the command: python -m nmt.nmt
--src=ro --tgt=en
--vocab_prefix=/tmp/nmt_data/vocab
--train_prefix=/tmp/nmt_data/train
--dev_prefix=/tmp/nmt_data/tst2012
--test_prefix=/tmp/nmt_data/tst2013
--out_dir=/tmp/nmt_model
--num_train_steps=12000
--steps_per_stats=100
--num_layers=2
--num_units=128
--dropout=0.2
--metrics=bleu I named the files exactly the same and I put it in /tmp/nmt_data.

I also do not understand how the vocab.en files and vocab.vi files (from https://nlp.stanford.edu/projects/nmt/) do not have 50k lines, as it contains the most 50k frequent words... The ones I made for Romanian have 50k line. Maybe here is the issue?

Thanks in advance!

Apr 03 '19 09:04 Alegzandra

Hi @Alegzandra how's your progress on training this model using Romanian text to English text? I am preparing data for training English to Chinese text. How do you extract the 50k words?

Oct 21 '21 17:10 Anyixu

nmt nmt copied to clipboard

issue when training on Romanian language

nmt
nmt copied to clipboard