alpaca-lora
alpaca-lora copied to clipboard
finetuning with gradient_checkpointing=True on 30B model
convergence is slow when set gradient_checkpointing=True.
I have fixed the llama's gradient checkpointing bug ( issue: https://github.com/huggingface/transformers/pull/22270/commits/fe32d793729276f6786eda4619281a328803713e)
My settings: A100 with 80G VRAM micro_batch_size = 4 (gradient_checkpointing=False) micro_batch_size = 32 (gradient_checkpointing=True)
how this works? do you use deepspeed? can you share the deepspeed config?
what's the max_seq_len in your finetune stage? do you ever try 2048?
how to solve the error?Thank you. finetuning with gradient_checkpointing=True, error message:
│ /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:2739 in training_step │
│ │
│ 2736 │ │ │ loss = loss / self.args.gradient_accumulation_steps │
│ 2737 │ │ │
│ 2738 │ │ if self.do_grad_scaling: │
│ ❱ 2739 │ │ │ self.scaler.scale(loss).backward() │
│ 2740 │ │ elif self.use_apex: │
│ 2741 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2742 │ │ │ │ scaled_loss.backward() │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/_tensor.py:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py:204 in backward │
│ │
│ 201 │ # The reason we repeat same the comment below is that │
│ 202 │ # some Python versions print out the first line of a multi-line function │
│ 203 │ # calls in the traceback and some print out the last line │
│ ❱ 204 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 205 │ │ tensors, grad_tensors_, retain_graph, create_graph, inputs, │
│ 206 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 207 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn