transformers Cannot replicate T5 performance on WMT14

System Info

I am trying to replicate T5 finetuning on WMT with the following hyperparameters (as close as possible to the paper https://www.jmlr.org/papers/volume21/20-074/20-074.pdf):

--model_name_or_path t5-small --source_lang en --target_lang de --dataset_name stas/wmt14-en-de-pre-processed --max_source_length 512 --max_target_length 512 --val_max_target_length 512 --source_prefix="translate English to German: " --predict_with_generate --save_steps 5000 --eval_steps 5000 --learning_rate 0.001 --max_steps 262144 --optim adafactor --lr_scheduler_type constant --gradient_accumulation_steps 2 --per_device_train_batch_size 64

However, the best model performance I get is around 13 BLEU whereas in the paper reported BLEU is around 27. Any comments on how to fix this ?

Script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py Environment:

transformers version: 4.20.1
Platform: Linux-4.18.0-348.el8.x86_64-x86_64-with-glibc2.28
Python version: 3.10.4
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.12.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes - A100
Using distributed or parallel set-up in script?: No

Who can help?

@patrickvonplaten, @sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Use the script with the hyperparameters above : https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py

Expected behavior

BLEU score should be around 27.

Aug 02 '22 14:08 ekurtulus

Hey @ekurtulus, such a low BLEU score looks indeed suspicious! Do you have any training stats / logs / graphs to share?

Aug 23 '22 17:08 patrickvonplaten

Just a tip: It might be a good idea to save the predictions (here the translation) during evaluation, so we can look into them to see what might goes wrong.

When saving the translation, it's better to save the source text and the label (target text) too. I do this in a manual way though, this is not directly available in the official training scripts.

Aug 23 '22 18:08 ydshieh

Sorry for being late. I will take a look.

Sep 19 '22 09:09 ydshieh

Hey @ekurtulus, such a low BLEU score looks indeed suspicious! Do you have any training stats / logs / graphs to share?

My experiments are on an HPC system, so since it's been a while, I unfortunately do not have the logs or the graphs.

Sep 20 '22 22:09 ekurtulus

@patrickvonplaten @patil-suraj Do you know if --dataset_name stas/wmt14-en-de-pre-processed (which is pre-processed using a script from fairseq) is the good dataset for T5 (En -> German)?

T5 is from Google, and in the paper, I can't find any mention of fairseq. I think T5 doesn't use this particular pre-processing, but I am not 100% sure.

Sep 21 '22 08:09 ydshieh

@ekurtulus I also think the checkpoints t5-small, t5-base etc. have been trained on WMT / CNN Dailymail datasets, as shown in the code snippet below. So using those checkpoints to replicate the results (by finetuning on those datasets) doesn't really make sense IMO.

Code snippet

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

inputs = tokenizer(
    "translate English to German: I am a good student.",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

inputs = tokenizer(
    "translate English to French: I am a good student.",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
   """WASHINGTON (CNN) -- Doctors removed five small polyps from President Bush's colon on Saturday, and "none appeared worrisome," a White House spokesman said. The polyps were removed and sent to the National Naval Medical Center in Bethesda, Maryland, for routine microscopic examination, spokesman Scott Stanzel said. Results are expected in two to three days. All were small, less than a centimeter [half an inch] in diameter, he said. Bush is in good humor, Stanzel said, and will resume his activities at Camp David. During the procedure Vice President Dick Cheney assumed presidential power. Bush reclaimed presidential power at 9:21 a.m. after about two hours. Doctors used "monitored anesthesia care," Stanzel said, so the president was asleep, but not as deeply unconscious as with a true general anesthetic. He spoke to first lady Laura Bush -- who is in Midland, Texas, celebrating her mother's birthday -- before and after the procedure, Stanzel said. Afterward, the president played with his Scottish terriers, Barney and Miss Beazley, Stanzel said. He planned to have lunch at Camp David and have briefings with National Security Adviser Stephen Hadley and White House Chief of Staff Josh Bolten, and planned to take a bicycle ride Saturday afternoon. Cheney, meanwhile, spent the morning at his home on Maryland's eastern shore, reading and playing with his dogs, Stanzel said. Nothing occurred that required him to take official action as president before Bush reclaimed presidential power. The procedure was supervised by Dr. Richard Tubb, Bush's physician, and conducted by a multidisciplinary team from the National Naval Medical Center in Bethesda, Maryland, the White House said. Bush's last colonoscopy was in June 2002, and no abnormalities were found, White House spokesman Tony Snow said. The president's doctor had recommended a repeat procedure in about five years. A colonoscopy is the most sensitive test for colon cancer, rectal cancer and polyps, small clumps of cells that can become cancerous, according to the Mayo Clinic. Small polyps may be removed during the procedure. Snow said on Friday that Bush had polyps removed during colonoscopies before becoming president. Snow himself is undergoing chemotherapy for cancer that began in his colon and spread to his liver. Watch Snow talk about Bush's procedure and his own colon cancer » . "The president wants to encourage everybody to use surveillance," Snow said. The American Cancer Society recommends that people without high risk factors or symptoms begin getting screened for signs of colorectal cancer at age 50. E-mail to a friend ."""
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Outputs

Ich bin ein guter Student.
Je suis un bon étudiant.

five small polyps were removed from president Bush's colon on Saturday. none of the polyps appeared worrisome, a white house spokesman said. During the procedure, vice president Dick Cheney assumed presidential power.

Sep 21 '22 11:09 ydshieh

@patrickvonplaten @patil-suraj Do you know if --dataset_name stas/wmt14-en-de-pre-processed (which is pre-processed using a script from fairseq) is the good dataset for T5 (En -> German)?

T5 is from Google, and in the paper, I can't find any mention of fairseq. I think T5 doesn't use this particular pre-processing, but I am not 100% sure.

Fairseq preprocessed version is suggested at the official repository.

Sep 26 '22 05:09 ekurtulus

@ekurtulus I also think the checkpoints t5-small, t5-base etc. have been trained on WMT / CNN Dailymail datasets, as shown in the code snippet below. So using those checkpoints to replicate the results (by finetuning on those datasets) doesn't really make sense IMO.

Code snippet

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

inputs = tokenizer(
    "translate English to German: I am a good student.",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

inputs = tokenizer(
    "translate English to French: I am a good student.",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
   """WASHINGTON (CNN) -- Doctors removed five small polyps from President Bush's colon on Saturday, and "none appeared worrisome," a White House spokesman said. The polyps were removed and sent to the National Naval Medical Center in Bethesda, Maryland, for routine microscopic examination, spokesman Scott Stanzel said. Results are expected in two to three days. All were small, less than a centimeter [half an inch] in diameter, he said. Bush is in good humor, Stanzel said, and will resume his activities at Camp David. During the procedure Vice President Dick Cheney assumed presidential power. Bush reclaimed presidential power at 9:21 a.m. after about two hours. Doctors used "monitored anesthesia care," Stanzel said, so the president was asleep, but not as deeply unconscious as with a true general anesthetic. He spoke to first lady Laura Bush -- who is in Midland, Texas, celebrating her mother's birthday -- before and after the procedure, Stanzel said. Afterward, the president played with his Scottish terriers, Barney and Miss Beazley, Stanzel said. He planned to have lunch at Camp David and have briefings with National Security Adviser Stephen Hadley and White House Chief of Staff Josh Bolten, and planned to take a bicycle ride Saturday afternoon. Cheney, meanwhile, spent the morning at his home on Maryland's eastern shore, reading and playing with his dogs, Stanzel said. Nothing occurred that required him to take official action as president before Bush reclaimed presidential power. The procedure was supervised by Dr. Richard Tubb, Bush's physician, and conducted by a multidisciplinary team from the National Naval Medical Center in Bethesda, Maryland, the White House said. Bush's last colonoscopy was in June 2002, and no abnormalities were found, White House spokesman Tony Snow said. The president's doctor had recommended a repeat procedure in about five years. A colonoscopy is the most sensitive test for colon cancer, rectal cancer and polyps, small clumps of cells that can become cancerous, according to the Mayo Clinic. Small polyps may be removed during the procedure. Snow said on Friday that Bush had polyps removed during colonoscopies before becoming president. Snow himself is undergoing chemotherapy for cancer that began in his colon and spread to his liver. Watch Snow talk about Bush's procedure and his own colon cancer » . "The president wants to encourage everybody to use surveillance," Snow said. The American Cancer Society recommends that people without high risk factors or symptoms begin getting screened for signs of colorectal cancer at age 50. E-mail to a friend ."""
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Outputs

Ich bin ein guter Student.
Je suis un bon étudiant.

five small polyps were removed from president Bush's colon on Saturday. none of the polyps appeared worrisome, a white house spokesman said. During the procedure, vice president Dick Cheney assumed presidential power.

What checkpoint should use then ?

Sep 26 '22 05:09 ekurtulus

@patrickvonplaten @patil-suraj Do you know if --dataset_name stas/wmt14-en-de-pre-processed (which is pre-processed using a script from fairseq) is the good dataset for T5 (En -> German)? T5 is from Google, and in the paper, I can't find any mention of fairseq. I think T5 doesn't use this particular pre-processing, but I am not 100% sure.

Fairseq preprocessed version is suggested at the official repository.

I think my colleagues @patil-suraj and @patrickvonplaten are the best persons for this question. The trainer script could work with several models (T5, Bart, etc.). Bart is from facebook/fairseq (so probably used the pre-processed dataset), but T5 is from Google. I am not 100% sure if the combination stas/wmt14-en-de-pre-processed + T5 is the best choice to compare against the original T5 checkpoint performance (which seems to be trained already on the translation task).

If you would like to, one thing you could try is to measure the T5 checkpoint performance against the original WMT14 dataset without any finetuning. And probably against the preprocessed dataset version too. From there, we might get better ideas.

Sep 26 '22 09:09 ydshieh

Note that we cannot guarantee perfect replication of all models for every result in their respective paper. Given the extremely low results of your training though there is probably a bug.

Here I'd suggest to try out different learning rates, learning rate schedulers (e.g. --lr_scheduler_type constant looks weird to me, I think a linear decrease makes more sense). Also note that the original model was trained on TPU with Tensorflow in bfloat16 where as here we're training on GPU with PyTorch. Good that you have a A100 - could you try simply using:

AdamW (not adafactor as we don't have the official implementation)
linear warmup + linear descent for learning rate scheduler

instead?

Sep 27 '22 11:09 patrickvonplaten

Agree with @patrickvonplaten, especially for the AdamW optimizer.

I think Hugging Face Forums would be a better place for this question - if you want to post there too. If a bug (say in the model or in the training script) is found, don't hesitate to report here :-)

Sep 27 '22 16:09 ydshieh

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 22 '22 15:10 github-actions[bot]

Hi, I am trying to reproduce the performance of transformer-base (from attention is all you need) on WMT14. I am using FSMT because I cannot find an implementation of the transformer. I was wondering which dataset and tokenizer are the best choices.

stas/wmt14-en-de-pre-processed with facebook/wmt19-en-de
wmt14 with facebook/wmt19-en-de Especially, I do not know which tokenizer should be used.

Thanks in advance if you could provide some suggestions!

Nov 29 '22 12:11 shizhediao

unstale

Nov 29 '22 12:11 shizhediao

transformers transformers copied to clipboard

Cannot replicate T5 performance on WMT14

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Code snippet

Outputs

Code snippet

Outputs

transformers
transformers copied to clipboard