Introduction

This project provides a PyTorch implementation about Attention is all you need based on fairseq-py (An official toolkit of facebook research). You can also use official code about Attention is all you need from tensor2tensor.

If you use this code about cnn, please cite:

@inproceedings{gehring2017convs2s,
  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,
}

And if you use this code about transformer, please cite:

@inproceedings{46201,
  title = {Attention is All You Need},
  author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
  year  = {2017},
  booktitle = {Proc. of NIPS},
}

Feel grateful for the contribution of the facebook research and google research. Besides, if you get benefits from this repository, please give me a star.

Details

How to install Transformer-PyTorch

You first need to install PyTorch >= 0.4.0 and Python = 3.6. And then

pip install -r requirements.txt
python setup.py build
python setup.py develop

Generating binary data, please follow the script under data/, i have provide a run script for iwslt14.

Results

IWSLT14 German-English

This dataset contains 160K training sentences. We recommend you to use transformer_small setting. The beam size is set as 5. The results are as follow:

Word Type	BLEU
10K jointly-sub-word	31.06
25K jointly-sub-word	32.12

Please try more checkpoint, not only the last checkpoint.

Nist Chinese-English

This dataset contains 1.25M training sentences. We learn a 25K subword dictionary for source and target languages respectively. We adopt a transformer_base model setting. The results are as follow:

	MT04	MT05	MT06	MT08	MT12
Beam=10	40.67	40.57	38.77	32.26	31.04

WMT14 English-German

This dataset contains 4.5M sentence pairs.

model Setting	BLEU
transformer_big	28.48

WMT14 English-French

This dataset include 36M sentence pairs. We learned a 40K BPE for english and french. Beam size = 5. And 8 GPUs are used in this task. For base model setting, the batch size is 4000 for each gpu.

Steps	BLEU
2w	34.42
5w	37.14
12w	38.72
17w	39.06
21w	39.30

And For big model, the batch size is 3072 for each gpu. The result is as:

Steps	BLEU
5.5w	38.00
11w	39.44
16w	40.21
27w	40.46
30w	40.76

Limited to resource, i just conduct experiment only once on Big model setting and do not try more parameters such as learning rate. I think you can produce better performance if you have rich GPUs.

Note

This project is only maintained by myself. Therefore, there still exists some minor errors in code style.

Instead of adam, i try NAG as the default optimizer, i find this optimized method can also produce better performance.

If you have more suggestions for improving this project, leaving message under issues.

Our many works are built upon this project, include:

Double Path Networks for Sequence to Sequence Learning, (COLING 2018)

Other submitted papers.

License

fairseq is BSD-licensed. The released codes modified the original fairseq are BSD-licensed. The rest of the codes are MIT-licensed.

Transformer-PyTorch
Transformer-PyTorch copied to clipboard

Metadata

Introduction

Details

How to install Transformer-PyTorch

Results

IWSLT14 German-English

Nist Chinese-English

WMT14 English-German

WMT14 English-French

Note

License

← Metadata

Owner

Metadata

Transformer-PyTorch Transformer-PyTorch copied to clipboard

Metadata

Introduction

Details

How to install Transformer-PyTorch

Results

IWSLT14 German-English

Nist Chinese-English

WMT14 English-German

WMT14 English-French

Note

License

← Metadata

Owner

Metadata

Transformer-PyTorch
Transformer-PyTorch copied to clipboard