ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

请问运行bash ds_train_finetune.sh所需的最小资源是多少?

Open songmzhang opened this issue 1 year ago • 10 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

目前我尝试用单张A100 80G或4张A100 40G运行默认的 ds_train_finetune.sh 脚本,使用 AdvertiseGen 数据集在 per_device_train_batch_size=1 的情况下仍然会爆显存,请问全量微调的最低资源要求是多少?

Expected Behavior

No response

Steps To Reproduce

N/A

Environment

- Python: 3.10
- Transformers: 4.30.2
- PyTorch: 1.13
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

songmzhang avatar Jul 06 '23 09:07 songmzhang

  1. GPU训练:我的是rtx2060 6G显卡跑不起来。所以我用的CPU训练,CPU昨天训练很快把我的64G内存就曾爆了,后来做了一点优化后可以正常训练,在没有做任何量化情况下全量参数训练消耗35G内存,其他配置为:torch_type=float16,批大小(1*16=16),ab句拼接后最大长度500(--per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_source_length 300 --max_target_length 200)

优化方法: model = ChatGLMForConditionalGeneration.from_pretrained( pretrained_model_name_or_path=model_and_tokenizer_config_dir, config=glm_config, device_map="auto", torch_dtype="auto",
low_cpu_mem_usage=True, low_cpu_mem_usage=True,
load_in_8bit=False ) CPU训练要关闭模型量化的代码: if model_args.quantization_bit is not None: if device != torch.device('cpu'): # gpu print("model_args.quantization_bit: ", model_args.quantization_bit) model = model.quantize(model_args.quantization_bit)
print("model.quantize 完成!") 同时关闭高校训练的模型半精度: if model_args.pre_seq_len is not None: # 使用P-tuning-v2的高效Finetune if device == torch.device('cpu'): model.float() else: model = model.half()
model.transformer.prefix_encoder.float()

以上方法在CPU上很有效。 你在GPU上训练,4bit量化+半精度+大GPU,能被迅速撑爆的话就也需要考虑以下模型加载和参数配置?

lilongxian avatar Jul 06 '23 09:07 lilongxian

至少16张A100 80G把

zhangfan-algo avatar Jul 07 '23 02:07 zhangfan-algo

两张A100 80G,观测了一下,第一张用了一半,开始用第二张,然后OOM,batch-size调到2,顺便问一下为啥第一张卡用不满就用第二张卡了

LittleXu1998 avatar Jul 07 '23 03:07 LittleXu1998

试了下可以增加deepspeed的optimizer配置至少可以跑起来,a10 8卡

traveler-vee avatar Jul 07 '23 09:07 traveler-vee

试了下可以增加deepspeed的optimizer配置至少可以跑起来,a10 8卡

请问具体是怎么增加optimizer配置的呢

songmzhang avatar Jul 07 '23 09:07 songmzhang

"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },

traveler-vee avatar Jul 07 '23 09:07 traveler-vee

试试这个,我是才开始跑了,不知道后面有坑没

traveler-vee avatar Jul 07 '23 09:07 traveler-vee

transformer自带的optimizer性能不好

traveler-vee avatar Jul 07 '23 09:07 traveler-vee

--gradient_accumulation_steps 16 \

traveler-vee avatar Jul 07 '23 09:07 traveler-vee

"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },

好的,感谢

songmzhang avatar Jul 07 '23 09:07 songmzhang

试试这个,我是才开始跑了,不知道后面有坑没

请问你后来全量微调跑起来了吗? 如果跑起来了,Python torch cuda deepspeed 版本分别是什么呢?

bigbigwatermalon avatar Jul 12 '23 07:07 bigbigwatermalon

你们的显卡都是哪里来的……

wangnan229 avatar Jul 24 '23 02:07 wangnan229