ChatGLM2-6B
ChatGLM2-6B copied to clipboard
请问运行bash ds_train_finetune.sh所需的最小资源是多少?
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
目前我尝试用单张A100 80G或4张A100 40G运行默认的 ds_train_finetune.sh
脚本,使用 AdvertiseGen
数据集在 per_device_train_batch_size=1
的情况下仍然会爆显存,请问全量微调的最低资源要求是多少?
Expected Behavior
No response
Steps To Reproduce
N/A
Environment
- Python: 3.10
- Transformers: 4.30.2
- PyTorch: 1.13
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
Anything else?
No response
- GPU训练:我的是rtx2060 6G显卡跑不起来。所以我用的CPU训练,CPU昨天训练很快把我的64G内存就曾爆了,后来做了一点优化后可以正常训练,在没有做任何量化情况下全量参数训练消耗35G内存,其他配置为:torch_type=float16,批大小(1*16=16),ab句拼接后最大长度500(--per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_source_length 300 --max_target_length 200)
优化方法:
model = ChatGLMForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path=model_and_tokenizer_config_dir,
config=glm_config,
device_map="auto",
torch_dtype="auto",
low_cpu_mem_usage=True,
low_cpu_mem_usage=True,
load_in_8bit=False
)
CPU训练要关闭模型量化的代码:
if model_args.quantization_bit is not None:
if device != torch.device('cpu'): # gpu
print("model_args.quantization_bit: ", model_args.quantization_bit)
model = model.quantize(model_args.quantization_bit)
print("model.quantize 完成!")
同时关闭高校训练的模型半精度:
if model_args.pre_seq_len is not None:
# 使用P-tuning-v2的高效Finetune
if device == torch.device('cpu'):
model.float()
else:
model = model.half()
model.transformer.prefix_encoder.float()
以上方法在CPU上很有效。 你在GPU上训练,4bit量化+半精度+大GPU,能被迅速撑爆的话就也需要考虑以下模型加载和参数配置?
至少16张A100 80G把
两张A100 80G,观测了一下,第一张用了一半,开始用第二张,然后OOM,batch-size调到2,顺便问一下为啥第一张卡用不满就用第二张卡了
试了下可以增加deepspeed的optimizer配置至少可以跑起来,a10 8卡
试了下可以增加deepspeed的optimizer配置至少可以跑起来,a10 8卡
请问具体是怎么增加optimizer配置的呢
"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },
试试这个,我是才开始跑了,不知道后面有坑没
transformer自带的optimizer性能不好
--gradient_accumulation_steps 16 \
"optimizer": { "type": "AdamW", "params": { "lr": 1e-5, "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true },
好的,感谢
试试这个,我是才开始跑了,不知道后面有坑没
请问你后来全量微调跑起来了吗? 如果跑起来了,Python torch cuda deepspeed 版本分别是什么呢?
你们的显卡都是哪里来的……