LoRA icon indicating copy to clipboard operation
LoRA copied to clipboard

Can't reproduce the results for GLUE CoLA

Open fxmarty opened this issue 1 year ago • 3 comments

My steps:

git clone https://github.com/microsoft/LoRA.git
cd LoRA
pip install -e .
cd examples/NLU
pip install -e .

Change export num_gpus=8 to export num_gpus=1 in roberta_large_cola.sh

Then CUDA_VISIBLE_DEVICES=0 bash roberta_large_cola.sh

Running on a single A100

Using:

  • datasets 2.6.1
  • python 3.9.13
  • PyTorch 1.13.0+cu117

During training, the eval_matthews_correlation is stuck to 0 at all epochs. I actually had the same issue on the current transformers version, and decreasing the learning rate + no warmup helped to regain OKeyish numbers during training, but not as shiny as 0.68.

Do you have an idea of what I could be doing wrong?

Update: using

export num_gpus=1
export CUBLAS_WORKSPACE_CONFIG=":16:8" # https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
export PYTHONHASHSEED=0
export output_dir="./roberta_cola_custom_sh"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-large \
--task_name cola \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \  # original: 4
--learning_rate 2e-5 \  # original: 3e-4
--num_train_epochs 20 \
--output_dir $output_dir/model \
--logging_steps 10 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.0 \  # original: 0.06
--apply_lora \
--lora_r 8 \
--lora_alpha 16 \
--seed 0 \
--weight_decay 0.0  # original: 0.1

trains just fine, I have no eval_matthews_correlation = 0 during training.

fxmarty avatar Nov 08 '22 21:11 fxmarty