alpaca-lora finetuning with gradient_checkpointing=True on 30B model

convergence is slow when set gradient_checkpointing=True.

I have fixed the llama's gradient checkpointing bug ( issue: https://github.com/huggingface/transformers/pull/22270/commits/fe32d793729276f6786eda4619281a328803713e)

My settings: A100 with 80G VRAM micro_batch_size = 4 (gradient_checkpointing=False) micro_batch_size = 32 (gradient_checkpointing=True)

Mar 21 '23 11:03 coni-coco

how this works？ do you use deepspeed? can you share the deepspeed config?

Apr 11 '23 08:04 echoht

what's the max_seq_len in your finetune stage? do you ever try 2048?

Apr 11 '23 08:04 echoht

how to solve the error?Thank you. finetuning with gradient_checkpointing=True, error message:

│ /usr/local/lib/python3.8/dist-packages/transformers/trainer.py:2739 in training_step             │
│                                                                                                  │
│   2736 │   │   │   loss = loss / self.args.gradient_accumulation_steps                           │
│   2737 │   │                                                                                     │
│   2738 │   │   if self.do_grad_scaling:                                                          │
│ ❱ 2739 │   │   │   self.scaler.scale(loss).backward()                                            │
│   2740 │   │   elif self.use_apex:                                                               │
│   2741 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2742 │   │   │   │   scaled_loss.backward()                                                    │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/_tensor.py:487 in backward                          │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py:204 in backward                │
│                                                                                                  │
│   201 │   # The reason we repeat same the comment below is that                                  │
│   202 │   # some Python versions print out the first line of a multi-line function               │
│   203 │   # calls in the traceback and some print out the last line                              │
│ ❱ 204 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   205 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   206 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   207                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Aug 25 '23 09:08 JiafeiSun

alpaca-lora alpaca-lora copied to clipboard

finetuning with gradient_checkpointing=True on 30B model

alpaca-lora
alpaca-lora copied to clipboard