blocks-examples icon indicating copy to clipboard operation
blocks-examples copied to clipboard

What is the target BLEU score for machine translation

Open zhangtemplar opened this issue 8 years ago • 13 comments

Hi,

I am running the machine translation examples, however the best BLEU score reported is 7.71 with more than 400000 iterations' training. However the paper neural machine translation by jointly learning to align and translate reports over 17.82 or 26.75.

Any idea on that?

zhangtemplar avatar Mar 24 '16 00:03 zhangtemplar

The results reported in the mentioned paper, are the BLEU scores of the trained models computed on the test set for English to French language pairs. As you can see in the prepare_data.py here, the languge pairs are Cs to En. Obviously, different language pair leads to different results. If you want to replicate the results in the paper, you have to ensure all of your parameters and settings are the same.

papar22 avatar Mar 24 '16 15:03 papar22

@papar22 yes, but I also check the website hosting the data (http://www.statmt.org/wmt15/translation-task.html), which has a statistics of BLEU for all language pairs over all dataset. The cs to en, still has something over 25.

zhangtemplar avatar Mar 24 '16 17:03 zhangtemplar

@zhangtemplar, here are the reasons why you dont see the state of the art BLEU scores by running the example,

  1. In this example, we do not use the entire cs-en corpus but only a small chunk of it. Entire corpus has 12M parallel sentences, but the provided prepare data script only downloads a subset of it (newscommentary-v10) which only has 150K parallel sentences.
  2. State of the art systems are using a lot of additional methods/tricks such as large vocabularies, ensembles, unk-replacements, language models, rescoring etc. which are not implemented here in this example. Finally, download the entire data, play with hyper-parameters and give it some time :)

orhanf avatar Mar 24 '16 17:03 orhanf

I tested with En-Fr, used the same data as in the paper, got 26.57 (the paper reported 26.75 after 5 days) after 5 days 8 hours using GPU (7-80k iterations/day). Just FYI.

tnq177 avatar Apr 03 '16 16:04 tnq177

@orhanf thanks for your explanation. But I still doubt why the difference is so huge, even considering I have less data and less tricks.

However, I realized they may have a different metrics for BLEU. I found the code reports five value for BLEU, namely BLEU-1, BLEU-2, BLEU-3, BLEU-4 and finally BLEU, which seems to be an exponential average of the previous four BLEUs. For BLEU I can something close to 30.

@tnq177 for 26.57, are you referring to BLEU or BLEU-1?

zhangtemplar avatar Apr 03 '16 22:04 zhangtemplar

@zhangtemplar BLEU

tnq177 avatar Apr 03 '16 22:04 tnq177

@tnq177 interesting. I may need to try en-fr instead.

zhangtemplar avatar Apr 04 '16 06:04 zhangtemplar

@tnq177 Did you just use the code provided by tensorflow? can you tell me the details of the model?( tensorflow version, stack size, layer size, sampled_loss(sample size), voca_size, etc.) Thanks

yanghoonkim avatar Jun 20 '16 00:06 yanghoonkim

@ad26kt no, I used this blocks-examples/machine-translation code. I didn't make any change in the configuration, except using en-fr data as detailed in Bahdanau's paper.

tnq177 avatar Jun 20 '16 01:06 tnq177

@tnq177 oh, I thought it is tensorflow github. Thanks for replying

yanghoonkim avatar Jun 20 '16 05:06 yanghoonkim

@tnq177 can you tell me which part of the prepare_data.py should I fix to using en-fr data as detailed in Bahdanau's paper? I tried several way, but it caused an error like :

File "prepare_data.py", line 135, in create_vocabularies
    if n.endswith(args.source)][0]]) + '.tok'
IndexError: list index out of range

I'm a little bit urgent , so your help will be really appreciated

yanghoonkim avatar Jun 21 '16 04:06 yanghoonkim

@ad26kt I juse use that script as reference. You can take the data here http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/ (noted in Bahdanau's paper), preprocess the data as the authors mention in the paper (only tokenize, I believe.). Then to create vocabulary files, you can use GroundHog's preprocess.py (PREPROCESS_URL in prepare_data.py); the command should be something like python preprocess.py -d vocab_file_name.pkl -v number_of_unique_tokens_used train_file. You can use the function shuffle_parallel in prepare_data.py to shuffle the training data files. Now, write the correct configuration function in configurations.py with correct path to train/dev/test/vocab... files. Basically just follow the steps in prepare_data.py.

tnq177 avatar Jun 21 '16 04:06 tnq177

@tnq177 Thanks a lot!

yanghoonkim avatar Jun 21 '16 05:06 yanghoonkim