wmt16-scripts icon indicating copy to clipboard operation
wmt16-scripts copied to clipboard

problem while training

Open zoe9823 opened this issue 2 years ago • 9 comments

./train.sh Traceback (most recent call last): File "config.py", line 49, in external_validation_script='validate.sh') TypeError: train() got an unexpected keyword argument 'saveto'

zoe9823 avatar Feb 14 '22 21:02 zoe9823

the argument saveto has been changed to model in newer versions of Nematus - you can change this in config.py.

Unless you specifically want to reproduce our WMT16 systems, I recommend that you have a look at https://github.com/EdinburghNLP/wmt17-scripts/tree/master/training , which uses better hyperparameters, or even https://github.com/EdinburghNLP/wmt17-transformer-scripts , which shows how to train a Transformer with Nematus.

rsennrich avatar Feb 15 '22 15:02 rsennrich

the argument saveto has been changed to model in newer versions of Nematus - you can change this in config.py.

Unless you specifically want to reproduce our WMT16 systems, I recommend that you have a look at https://github.com/EdinburghNLP/wmt17-scripts/tree/master/training , which uses better hyperparameters, or even https://github.com/EdinburghNLP/wmt17-transformer-scripts , which shows how to train a Transformer with Nematus.

Thank you very much for your reply. I want to train FACTORED NEURAL MACHINE TRANSLATION with POS of src, I wanna to know whether I can train it in the two new ways you mentioned above or not.

zoe9823 avatar Feb 15 '22 15:02 zoe9823

yes, this is possible.

You can do preprocessing as instructed in this repository (or with your own modifications). To then train the actual model, I suggest your start with this config https://github.com/EdinburghNLP/wmt17-transformer-scripts/blob/master/training/scripts/train.sh and make some additions to define the number of factors and the embedding size per factor (for example --factors 5 --dim_per_factor 448 16 16 16 16) - the actual number of factors depends on your data, and the embedding size per factor should depend on the vocabulary size of each factor. You'll probably also want to remove the argument --tie_encoder_decoder_embeddings.

rsennrich avatar Feb 15 '22 16:02 rsennrich

I am trying https://github.com/EdinburghNLP/wmt17-transformer-scripts with sample files, but while running training/scripts/train.sh happend:

INFO: Initializing model parameters from scratch... INFO: Done INFO: Reading data... INFO: Done INFO: Initial uidx=0 INFO: Starting epoch 0 Traceback (most recent call last): File "/home/zoe/nematus//nematus/train.py", line 522, in train(config, sess) File "/home/zoe/nematus//nematus/train.py", line 212, in train if len(source_sents[0][0]) != config.factors: IndexError: list index out of range

zoe9823 avatar Feb 16 '22 12:02 zoe9823

can you show the first line of your (source-side) training data, and your training arguments? It is likely that the number of actual factors your data contains is different from the number you specified as training argument.

rsennrich avatar Feb 16 '22 13:02 rsennrich

corpus.bpe.en:

resumption of the session
I declare resumed the session of the European Parliament ad@@ jour@@ ned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant fes@@ tive period .
although , as you will have seen , the dre@@ aded ' millennium bug ' failed to materi@@ alise , still the people in a number of countries suffered a series of natural disasters that truly were dre@@ ad@@ ful .
you have requested a debate on this subject in the course of the next few days , during this part @-@ session .
in the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible stor@@ ms , in the various countries of the European Union .
please rise , then , for this minute ' s silence .
( the House rose and observed a minute ' s silence )
Madam President , on a point of order .

.....

training/scripts/train.sh:

#!/usr/bin/env sh
# Distributed under MIT license

script_dir=`dirname $0`
main_dir=$script_dir/../
data_dir=$main_dir/data
working_dir=$main_dir/model

# variables (toolkits; source and target language)
. $main_dir/vars

# TensorFlow devices; change this to control the GPUs used by Nematus.
# It should be a list of GPU identifiers. For example, '1' or '0,1,3'
devices=0

# Training command that closely follows the 'base' configuration from the
# paper
#
#  "Attention is All you Need" in Advances in Neural Information Processing
#  Systems 30, 2017. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
#  Uszkoreit, Llion Jones, Aidan N Gomez, Lukadz Kaiser, and Illia Polosukhin.
#
# Depending on the size and number of available GPUs, you may need to adjust
# the token_batch_size parameter. The command used here was tested on a
# machine with four 12 GB GPUS.
CUDA_VISIBLE_DEVICES=$devices python3 $nematus_home/nematus/train.py \
    --source_dataset $data_dir/corpus.bpe.$src \
    --target_dataset $data_dir/corpus.bpe.$trg \
    --dictionaries $data_dir/corpus.bpe.both.json \
                   $data_dir/corpus.bpe.both.json \
    --save_freq 30000 \
    --model $working_dir/model \
    --reload latest_checkpoint \
    --model_type transformer \
    --embedding_size 512 \
    --state_size 512 \
    --tie_encoder_decoder_embeddings \
    --tie_decoder_embeddings \
    --loss_function per-token-cross-entropy \
    --label_smoothing 0.1 \
    --exponential_smoothing 0.0001 \
    --optimizer adam \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-09 \
    --learning_schedule transformer \
    --warmup_steps 4000 \
    --maxlen 100 \
    --batch_size 64 \
    --token_batch_size 64 \
    --valid_source_dataset $data_dir/newstest2013.bpe.$src \
    --valid_target_dataset $data_dir/newstest2013.bpe.$trg \
    --valid_batch_size 64 \
    --valid_token_batch_size 64 \
    --valid_freq 10000 \
    --valid_script $script_dir/validate.sh \
    --disp_freq 1000 \
    --sample_freq 0 \
    --beam_freq 0 \
    --beam_size 4 \
    --translation_maxlen 100 \
    --normalization_alpha 0.6

errors :

$ training/scripts/train.sh 
INFO: Namespace(adam_beta1=0.9, adam_beta2=0.98, adam_epsilon=1e-09, batch_size=64, beam_freq=0, beam_size=4, clip_c=1.0, datasets=None, decay_c=0.0, dictionaries=['training/scripts/..//data/corpus.bpe.both.json', 'training/scripts/..//data/corpus.bpe.both.json'], dim_per_factor=[512], disp_freq=1000, embedding_size=512, exponential_smoothing=0.0001, factors=1, finish_after=10000000, gradient_aggregation_steps=1, keep_train_set_in_memory=False, label_smoothing=0.1, layer_normalization_type='layernorm', learning_rate=0.0001, learning_schedule='transformer', loss_function='per-token-cross-entropy', map_decay_c=0.0, max_epochs=5000, max_len_a=1.5, max_len_b=5, max_sentences_of_sampling=0, max_sentences_per_device=0, max_tokens_per_device=0, maxibatch_size=20, maxlen=100, model_type='transformer', model_version=0.2, mrt_alpha=0.005, mrt_loss='SENTENCEBLEU n=4', mrt_ml_mix=0, mrt_reference=False, n_best=False, normalization_alpha=0.6, optimizer='adam', output_hidden_activation='tanh', patience=10, plateau_steps=0, preprocess_script=None, print_per_token_pro=False, prior_model=None, reload='latest_checkpoint', reload_training_progress=True, rnn_dec_base_transition_depth=2, rnn_dec_deep_context=False, rnn_dec_depth=1, rnn_dec_high_transition_depth=1, rnn_dropout_embedding=0.0, rnn_dropout_hidden=0.0, rnn_dropout_source=0.0, rnn_dropout_target=0.0, rnn_enc_depth=1, rnn_enc_transition_depth=1, rnn_layer_normalization=False, rnn_lexical_model=False, rnn_use_dropout=True, sample_freq=0, sample_way='beam_search', samplesN=100, sampling_temperature=1.0, save_freq=30000, saveto='training/scripts/..//model/model', shuffle_each_epoch=True, softmax_mixture_size=1, sort_by_length=True, source_dataset='training/scripts/..//data/corpus.bpe.en', source_dicts=['training/scripts/..//data/corpus.bpe.both.json'], source_vocab_sizes=[43939], state_size=512, summary_dir=None, summary_freq=0, target_dataset='training/scripts/..//data/corpus.bpe.de', target_dict='training/scripts/..//data/corpus.bpe.both.json', target_embedding_size=512, target_vocab_size=43939, theano_compat=False, tie_decoder_embeddings=True, tie_encoder_decoder_embeddings=True, token_batch_size=64, transformer_dec_depth=6, transformer_drophead=0.0, transformer_dropout_attn=0.1, transformer_dropout_embeddings=0.1, transformer_dropout_relu=0.1, transformer_dropout_residual=0.1, transformer_enc_depth=6, transformer_ffn_hidden_size=2048, transformer_num_heads=8, translation_maxlen=100, translation_strategy='beam_search', valid_batch_size=64, valid_bleu_source_dataset='training/scripts/..//data/newstest2013.bpe.en', valid_datasets=None, valid_freq=10000, valid_script='training/scripts/validate.sh', valid_source_dataset='training/scripts/..//data/newstest2013.bpe.en', valid_target_dataset='training/scripts/..//data/newstest2013.bpe.de', valid_token_batch_size=64, warmup_steps=4000)
2022-02-16 21:09:39.684281: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2022-02-16 21:09:39.711584: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1696075000 Hz
2022-02-16 21:09:39.712095: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d112175480 executing computations on platform Host. Devices:
2022-02-16 21:09:39.712136: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2022-02-16 21:09:39.713597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-02-16 21:09:39.736321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Quadro P1000 major: 6 minor: 1 memoryClockRate(GHz): 1.4805
pciBusID: 0000:04:00.0
2022-02-16 21:09:39.736699: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-02-16 21:09:39.739282: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-02-16 21:09:39.740964: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-02-16 21:09:39.741917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-02-16 21:09:39.744172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-02-16 21:09:39.745849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-02-16 21:09:39.751311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-02-16 21:09:39.752582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-02-16 21:09:39.752654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-02-16 21:09:39.886828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-16 21:09:39.886870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-02-16 21:09:39.886885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-02-16 21:09:39.888632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3304 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:04:00.0, compute capability: 6.1)
2022-02-16 21:09:39.890741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d1147d2700 executing computations on platform CUDA. Devices:
2022-02-16 21:09:39.890772: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Quadro P1000, Compute Capability 6.1
2022-02-16 21:09:39.892111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Quadro P1000 major: 6 minor: 1 memoryClockRate(GHz): 1.4805
pciBusID: 0000:04:00.0
2022-02-16 21:09:39.892164: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-02-16 21:09:39.892187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-02-16 21:09:39.892206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-02-16 21:09:39.892225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-02-16 21:09:39.892243: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-02-16 21:09:39.892262: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-02-16 21:09:39.892281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-02-16 21:09:39.893312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-02-16 21:09:39.893345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-16 21:09:39.893358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-02-16 21:09:39.893369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-02-16 21:09:39.894434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 3304 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:04:00.0, compute capability: 6.1)
INFO: Building model...
WARNING: From /home/zoe/anaconda3/envs/tfgpu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2022-02-16 21:09:41.612292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Quadro P1000 major: 6 minor: 1 memoryClockRate(GHz): 1.4805
pciBusID: 0000:04:00.0
2022-02-16 21:09:41.612358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2022-02-16 21:09:41.612385: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2022-02-16 21:09:41.612405: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2022-02-16 21:09:41.612424: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2022-02-16 21:09:41.612444: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2022-02-16 21:09:41.612463: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2022-02-16 21:09:41.612483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-02-16 21:09:41.613522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
INFO: Initializing model parameters from scratch...
INFO: Done
INFO: Reading data...
INFO: Done
INFO: Initial uidx=0
INFO: Starting epoch 0
Traceback (most recent call last):
  File "/home/zoe/nematus//nematus/train.py", line 522, in <module>
    train(config, sess)
  File "/home/zoe/nematus//nematus/train.py", line 212, in train
    if len(source_sents[0][0]) != config.factors:
IndexError: list index out of range

I only changed the batch_size to 64 as I have one GPU .

zoe9823 avatar Feb 16 '22 13:02 zoe9823

Now I use my own corpus to train https://github.com/EdinburghNLP/wmt17-transformer-scripts , but: $ training/scripts/train.sh ERROR: factors are not yet supported for the 'transformer' model type

really sorry for so many questions ...

zoe9823 avatar Feb 16 '22 16:02 zoe9823

ok, I neglected to check, but factors aren't currently implemented for the transformer architecture. You can still look at https://github.com/EdinburghNLP/wmt17-scripts/tree/master/training to see how to use the RNN-version of Nematus with the command-line interface, and use the hyperparameters from there as a starting point.

For your previous entry, I'm a bit confused because you say that you want to use factored models, but your input doesn't contain any extra factors. It is your own responsibility to preprocess the data pass it to Nematus in the right format: https://github.com/EdinburghNLP/nematus/blob/master/doc/factored_neural_machine_translation.md

As to your batch size, training with small batches will actually cost you quite a bit of translation quality. You can use --max-sentences-per-device or --max-tokens-per-device to reduce the number of sentences/tokens processed as once without affecting the final quality (because the gradients will then be accumulated until the batch size is reached). More information here: https://github.com/EdinburghNLP/nematus/blob/master/doc/multi_gpu_training.md

rsennrich avatar Feb 16 '22 16:02 rsennrich

For your previous entry, I'm a bit confused because you say that you want to use factored models, but your input doesn't contain any extra factors. It is your own responsibility to preprocess the data pass it to Nematus in the right format: https://github.com/EdinburghNLP/nematus/blob/master/doc/factored_neural_machine_translation.md

Sorry I didn't make myself clear. At the begining I want to use data from download_files.sh . After it works then I will use my corpus. I had preprocess my data into the word|POS form before according to wmt16-scripts.

As to your batch size, training with small batches will actually cost you quite a bit of translation quality. You can use --max-sentences-per-device or --max-tokens-per-device to reduce the number of sentences/tokens processed as once without affecting the final quality (because the gradients will then be accumulated until the batch size is reached). More information here: https://github.com/EdinburghNLP/nematus/blob/master/doc/multi_gpu_training.md

Thanks, it's really help me. I added --max_sentences_per_device 1024 \ to train.sh.

ok, I neglected to check, but factors aren't currently implemented for the transformer architecture. You can still look at https://github.com/EdinburghNLP/wmt17-scripts/tree/master/training to see how to use the RNN-version of Nematus with the command-line interface, and use the hyperparameters from there as a starting point.

Thanks, and I have started trying wmt17-scripts. Now It starts taining but stopped with these mesage:

2022-02-17 18:56:38.611359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO: [2022-02-17 19:05:14] Epoch: 0 Update: 1000 Loss/word: 6.757666225401321 Words/sec: 986.8154383790913 Sents/sec: 119.29339154103243
INFO: [2022-02-17 19:13:48] Epoch: 0 Update: 2000 Loss/word: 5.9732927486658935 Words/sec: 1030.8588122635615 Sents/sec: 124.43058756328645
INFO: [2022-02-17 19:22:21] Epoch: 0 Update: 3000 Loss/word: 5.711696204857109 Words/sec: 1030.4795891040014 Sents/sec: 124.87232427203912
INFO: [2022-02-17 19:30:54] Epoch: 0 Update: 4000 Loss/word: 5.538549520137736 Words/sec: 1029.0010922128727 Sents/sec: 124.61954312578668
INFO: Starting epoch 1
INFO: [2022-02-17 19:39:32] Epoch: 1 Update: 5000 Loss/word: 5.355917909553584 Words/sec: 1025.37964119183 Sents/sec: 123.52149395706236
Traceback (most recent call last):
  File "/home/zoe/nematus-master//nematus/train.py", line 522, in <module>
    train(config, sess)
  File "/home/zoe/nematus-master//nematus/train.py", line 301, in train
    valid_text_iterator)
  File "/home/zoe/nematus-master//nematus/train.py", line 397, in validate
    session, model, config, text_iterator, normalization_alpha=0.0)
  File "/home/zoe/nematus-master//nematus/train.py", line 477, in calc_cross_entropy_per_sentence
    if len(xx[0][0]) != config.factors:
IndexError: list index out of range

my config:

CUDA_VISIBLE_DEVICES=$device python $nematus_home/nematus/train.py \
    --model $working_dir/model \
    --source_dataset $data_dir/corpus.factorsbpe.$src \
    --target_dataset $data_dir/corpus.bpe.$trg \
    --factors 2 \
    --dim_per_factor 240 16 \
    --dim_word 256 \
    --dictionaries $data_dir/corpus.factorsbpe.$src.json \
                   $data_dir/corpus.factors.1.$src.json \
                   $data_dir/corpus.bpe.$trg.json \
    --valid_script $script_dir/validate.sh \
    --valid_source_dataset $data_dir/valid.factorsbpe.$src \
    --valid_target_dataset $data_dir/valid.bpe.$trg \
    --reload latest_checkpoint \
    --dim 128 \
    --lrate 0.0001 \
    --optimizer adam \
    --maxlen 50 \
    --max_sentences_per_device 1024 \
    --batch_size 64 \
    --valid_batch_size 32 \
    --valid_token_batch_size 32 \
    --validFreq 5000 \
    --dispFreq 1000 \
    --saveFreq 5000 \
    --sampleFreq 10000 \
    --tie_decoder_embeddings \
    --layer_normalisation \
    --dec_base_recurrence_transition_depth 8 \
    --enc_recurrence_transition_depth 4

My src corpus:

(|lq 1998|mj ལོ|tt འི་|gz ཟླ་|tt 6|mj ཚེས་|tt 1|mj ཉིན|tt )|lq །|lz 
དེང་དུས་|tt འཛམ་གླིང་|nn ཐོག་|ff ཆ་འཕྲིན་|nn ལག་རྩལ་|nn མཚོན་རྟགས་|nn གཙོ་བོ|nn ར་|nn གྱུར་པ|vi འི་|gz ཚན་|nn རྩལ་|nn ཉིན་རེ་བཞིན་|dp ཡར་ཐོན་|nv བྱུང་བ|vi །|lz 

zoe9823 avatar Feb 16 '22 18:02 zoe9823