ChatGLM2-6B 请问运行bash ds_train_finetune.sh所需的最小资源是多少？

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

目前我尝试用单张A100 80G或4张A100 40G运行默认的 ds_train_finetune.sh 脚本，使用 AdvertiseGen 数据集在 per_device_train_batch_size=1 的情况下仍然会爆显存，请问全量微调的最低资源要求是多少？

Expected Behavior

No response

Steps To Reproduce

N/A

Environment

- Python: 3.10
- Transformers: 4.30.2
- PyTorch: 1.13
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True

Anything else?

No response

Jul 06 '23 09:07 songmzhang

GPU训练：我的是rtx2060 6G显卡跑不起来。所以我用的CPU训练，CPU昨天训练很快把我的64G内存就曾爆了，后来做了一点优化后可以正常训练，在没有做任何量化情况下全量参数训练消耗35G内存，其他配置为：torch_type=float16，批大小（1*16=16），ab句拼接后最大长度500（--per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_source_length 300 --max_target_length 200）

优化方法： model = ChatGLMForConditionalGeneration.from_pretrained( pretrained_model_name_or_path=model_and_tokenizer_config_dir, config=glm_config, device_map="auto", torch_dtype="auto",
low_cpu_mem_usage=True， low_cpu_mem_usage=True,
load_in_8bit=False ) CPU训练要关闭模型量化的代码： if model_args.quantization_bit is not None: if device != torch.device('cpu'): # gpu print("model_args.quantization_bit: ", model_args.quantization_bit) model = model.quantize(model_args.quantization_bit)
print("model.quantize 完成！") 同时关闭高校训练的模型半精度： if model_args.pre_seq_len is not None: # 使用P-tuning-v2的高效Finetune if device == torch.device('cpu'): model.float() else: model = model.half()
model.transformer.prefix_encoder.float()

以上方法在CPU上很有效。你在GPU上训练，4bit量化+半精度+大GPU，能被迅速撑爆的话就也需要考虑以下模型加载和参数配置？

Jul 06 '23 09:07 lilongxian

至少16张A100 80G把

Jul 07 '23 02:07 zhangfan-algo

两张A100 80G，观测了一下，第一张用了一半，开始用第二张，然后OOM，batch-size调到2，顺便问一下为啥第一张卡用不满就用第二张卡了

Jul 07 '23 03:07 LittleXu1998

试了下可以增加deepspeed的optimizer配置至少可以跑起来，a10 8卡

Jul 07 '23 09:07 traveler-vee

试了下可以增加deepspeed的optimizer配置至少可以跑起来，a10 8卡

请问具体是怎么增加optimizer配置的呢

Jul 07 '23 09:07 songmzhang

"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },

Jul 07 '23 09:07 traveler-vee

试试这个，我是才开始跑了，不知道后面有坑没

Jul 07 '23 09:07 traveler-vee

transformer自带的optimizer性能不好

Jul 07 '23 09:07 traveler-vee

--gradient_accumulation_steps 16 \

Jul 07 '23 09:07 traveler-vee

"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },

好的，感谢

Jul 07 '23 09:07 songmzhang

试试这个，我是才开始跑了，不知道后面有坑没

请问你后来全量微调跑起来了吗？如果跑起来了，Python torch cuda deepspeed 版本分别是什么呢？

Jul 12 '23 07:07 bigbigwatermalon

你们的显卡都是哪里来的……

Jul 24 '23 02:07 wangnan229

ChatGLM2-6B ChatGLM2-6B copied to clipboard

请问运行bash ds_train_finetune.sh所需的最小资源是多少？

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM2-6B
ChatGLM2-6B copied to clipboard