LoRA
LoRA copied to clipboard
Can't reproduce the results for GLUE CoLA
My steps:
git clone https://github.com/microsoft/LoRA.git
cd LoRA
pip install -e .
cd examples/NLU
pip install -e .
Change export num_gpus=8
to export num_gpus=1
in roberta_large_cola.sh
Then CUDA_VISIBLE_DEVICES=0 bash roberta_large_cola.sh
Running on a single A100
Using:
- datasets 2.6.1
- python 3.9.13
- PyTorch 1.13.0+cu117
During training, the eval_matthews_correlation
is stuck to 0 at all epochs. I actually had the same issue on the current transformers version, and decreasing the learning rate + no warmup helped to regain OKeyish numbers during training, but not as shiny as 0.68.
Do you have an idea of what I could be doing wrong?
Update: using
export num_gpus=1
export CUBLAS_WORKSPACE_CONFIG=":16:8" # https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
export PYTHONHASHSEED=0
export output_dir="./roberta_cola_custom_sh"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-large \
--task_name cola \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \ # original: 4
--learning_rate 2e-5 \ # original: 3e-4
--num_train_epochs 20 \
--output_dir $output_dir/model \
--logging_steps 10 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.0 \ # original: 0.06
--apply_lora \
--lora_r 8 \
--lora_alpha 16 \
--seed 0 \
--weight_decay 0.0 # original: 0.1
trains just fine, I have no eval_matthews_correlation = 0 during training.