ChatGLM-6B
ChatGLM-6B copied to clipboard
[Help] 如何在多GPU上微调
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
问题:单个80G GPU会oom,多个GPU会报错,请问如何在多个GPU上微调 训练方式:huggingface Trainer
Expected Behavior
No response
Steps To Reproduce
Step 1. 尝试使用用 1 * a100-80g 失败 oom
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 79.35 GiB total capacity; 76.09 GiB already allocated; 336.19 MiB free; 76.10 GiB reserved in total by PyTorch)
Step 2. 尝试使用用 4 * A30-32g 失败,看起来不支持直接在多GPU train
it/s]Traceback (most recent call last):
File "train.py", line 35, in <module>
trainer.train()
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1547, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1791, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2539, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2571, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 434, in reraise
raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 1038, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 881, in forward
output_attentions=output_attentions
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 580, in forward
output_attentions=output_attentions
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 413, in forward
cos, sin = self.rotary_emb(q1, seq_len=position_ids.max() + 1)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 184, in forward
return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
TypeError: 'NoneType' object is not subscriptable
Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support: true
Anything else?
No response