DeepSpeedExamples It seems that lora does not work

When I run step1_supervised_finetuning, I found that the memory of the opt-1.3B model occupies about 15G, as if lora does not work, but when I print the trainable parameters, it is normal. Why is this?

When I run step1_supervised_finetuning, I found that the opt-1.3B model occupies about 15G of memory. It seems that lora does not work, but when printing the trainable parameters, it shows that everything is ok. Why is this?

trainable params: 7077888 || all params: 1322835968 || trainable%: 0.535054093721165

Apr 13 '23 09:04 blldd

@blldd Can you give your python environment configuration? My training is always erroor, including python, torch deepspeed transformers, for example: Python3.8. Deepspeed 0.9.0 Thanks.

Apr 13 '23 10:04 Vincent131499

Env: Ubuntu 18.04.5 LTS Python 3.9.16 deepspeed 0.9.0+c8fc9c5f transformers 4.28.0.dev0

Apr 13 '23 13:04 blldd

@blldd is 15GB only for deepspeed initialization or it is the peak memory consumption during training? During training, a lot of memory is consumed by activation/compute and others.

Apr 13 '23 15:04 yaozhewei

@blldd is 15GB only for deepspeed initialization or it is the peak memory consumption during training? During training, a lot of memory is consumed by activation/compute and others.

15GB is the memory consumption during training.

When I set --lora_dim 2, the trainable ratio = 0.13 and the memory consumption during training is 10G:

|||| trainable params: 1769472 || all params: 1317527552 || trainable%: 0.13430246656428177 |||| | N/A 54C P0 41W / 70W | 10008MiB / 15109MiB | 100% Default |

In my understanding, after using lora, the parameters of the main model are frozen, and only the lora layer is optimized during training, so the memory usage is about 1-2 times that of the main model parameters, and it seems to use more than ten times here.

Apr 14 '23 12:04 blldd

training_scripts:

deepspeed main.py
--data_path Dahoas/rm-static
--data_split 2,4,4
--model_name_or_path /data1/opt-iml-max-1.3b
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 1e-3
--weight_decay 0.1
--num_train_epochs 2
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage 0
--lora_dim 2
--lora_module_name decoder.layers.
--only_optimize_lora
--deepspeed
--output_dir $OUTPUT_PATH
&> $OUTPUT_PATH/training.log

Apr 14 '23 12:04 blldd

same problem. the lora moudle do not reduce gpu usage on my 8*v100 machine.

Apr 15 '23 06:04 MRKINKI

During training, you will need a lot of memory for intermediate activation. Also, please take a look at this https://github.com/microsoft/DeepSpeedExamples/issues/299

Apr 18 '23 17:04 yaozhewei

Close the issue since there is no followup. Please reopen it if you still need more clarification.

Apr 24 '23 18:04 yaozhewei

DeepSpeedExamples DeepSpeedExamples copied to clipboard

It seems that lora does not work

trainable params: 7077888 || all params: 1322835968 || trainable%: 0.535054093721165

DeepSpeedExamples
DeepSpeedExamples copied to clipboard