jiant
jiant copied to clipboard
Unable to reproduce XTREME numbers
Unable to Reproduce the XTREME numbers for xlm-roberta-large
We are unable to reproduce the xtreme benchmark numbers as reported in the original paper. I provide an example of PAWSX and XNLI here.
To Reproduce
- Branch: mainline
- Environment: 1 p4.8xlarge
- Hyperparams for
XNLI
--model_type $MODEL_TYPE \
--model_name_or_path $MODEL \
--train_language en \
--task_name xnli \
--do_train \
--do_eval \
--do_predict \
--gradient_accumulation_steps 4 \
--per_gpu_train_batch_size 64 \
--learning_rate 2e-5 \
--num_train_epochs 2 \
--max_seq_length 128 \
--output_dir $SAVE_DIR/ \
--save_steps 500 \
--logging_steps 500 \
--eval_all_checkpoints \
--log_file 'train' \
--predict_languages "ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh" \
--save_only_best_checkpoint \
--overwrite_output_dir
PAWSX
{
"jiant_task_container_config_path": "/home/ec2-user/jiant/xtreme-exp/runconfigs/pawsx.json",
"output_dir": "/home/ec2-user/jiant/xtreme-exp/runs/pawsx",
"hf_pretrained_model_name_or_path": "xlm-roberta-large",
"model_path": "/home/ec2-user/jiant/xtreme-exp/models/xlm-roberta-large/model/model.p",
"model_config_path": "/home/ec2-user/jiant/xtreme-exp/models/xlm-roberta-large/model/config.json",
"model_load_mode": "from_transformers",
"do_train": true,
"do_val": true,
"do_save": true,
"do_save_last": false,
"do_save_best": false,
"write_val_preds": false,
"write_test_preds": true,
"eval_every_steps": 1000,
"save_every_steps": 0,
"save_checkpoint_every_steps": 0,
"no_improvements_for_n_evals": 5,
"keep_checkpoint_when_done": false,
"force_overwrite": true,
"seed": 1146493838,
"learning_rate": 3e-05,
"adam_epsilon": 1e-08,
"max_grad_norm": 1.0,
"optimizer_type": "adam",
"no_cuda": false,
"fp16": false,
"fp16_opt_level": "O1",
"local_rank": -1,
"server_ip": "",
"server_port": ""
}
Results
"pawsx": {
"accuracy": {"de": 55.25,
"en": 54.65,
"es": 54.65,
"fr": 54.85,
"ja": 55.85,
"ko": 55.15,
"zh": 55.300000000000004},
"avg_accuracy": 55.1,
"avg_metric": 55.1},
This number is too low. We were expecting this number to be around ~80%.
Similarly, for XNLI the numbers we are getting are far lesser than those reported on the paper.
Is there something we are missing?
@zphang, mind taking a look?
Hi,
I believe the issue may have been that the XLM-R weights not being correctly loaded because of a recent update. I've made a PR that should address the issue (https://github.com/nyu-mll/jiant/pull/1329). Could you retry and let me know if it works?