DeepSpeedExamples
DeepSpeedExamples copied to clipboard
It seems that lora does not work
When I run step1_supervised_finetuning, I found that the memory of the opt-1.3B model occupies about 15G, as if lora does not work, but when I print the trainable parameters, it is normal. Why is this?
When I run step1_supervised_finetuning, I found that the opt-1.3B model occupies about 15G of memory. It seems that lora does not work, but when printing the trainable parameters, it shows that everything is ok. Why is this?
trainable params: 7077888 || all params: 1322835968 || trainable%: 0.535054093721165
@blldd Can you give your python environment configuration? My training is always erroor, including python, torch deepspeed transformers, for example: Python3.8. Deepspeed 0.9.0 Thanks.
Env: Ubuntu 18.04.5 LTS Python 3.9.16 deepspeed 0.9.0+c8fc9c5f transformers 4.28.0.dev0
@blldd is 15GB only for deepspeed initialization or it is the peak memory consumption during training? During training, a lot of memory is consumed by activation/compute and others.
@blldd is 15GB only for deepspeed initialization or it is the peak memory consumption during training? During training, a lot of memory is consumed by activation/compute and others.
15GB is the memory consumption during training.
When I set --lora_dim 2, the trainable ratio = 0.13 and the memory consumption during training is 10G:
|||| trainable params: 1769472 || all params: 1317527552 || trainable%: 0.13430246656428177 |||| | N/A 54C P0 41W / 70W | 10008MiB / 15109MiB | 100% Default |
In my understanding, after using lora, the parameters of the main model are frozen, and only the lora layer is optimized during training, so the memory usage is about 1-2 times that of the main model parameters, and it seems to use more than ten times here.
training_scripts:
deepspeed main.py
--data_path Dahoas/rm-static
--data_split 2,4,4
--model_name_or_path /data1/opt-iml-max-1.3b
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 1e-3
--weight_decay 0.1
--num_train_epochs 2
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage 0
--lora_dim 2
--lora_module_name decoder.layers.
--only_optimize_lora
--deepspeed
--output_dir $OUTPUT_PATH
&> $OUTPUT_PATH/training.log
same problem. the lora moudle do not reduce gpu usage on my 8*v100 machine.
During training, you will need a lot of memory for intermediate activation. Also, please take a look at this https://github.com/microsoft/DeepSpeedExamples/issues/299
Close the issue since there is no followup. Please reopen it if you still need more clarification.