ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[Help] 如何在多GPU上微调

Open MrToy opened this issue 2 years ago • 0 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

问题:单个80G GPU会oom,多个GPU会报错,请问如何在多个GPU上微调 训练方式:huggingface Trainer

Expected Behavior

No response

Steps To Reproduce

Step 1. 尝试使用用 1 * a100-80g 失败 oom

RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 79.35 GiB total capacity; 76.09 GiB already allocated; 336.19 MiB free; 76.10 GiB reserved in total by PyTorch) 

Step 2. 尝试使用用 4 * A30-32g 失败,看起来不支持直接在多GPU train

it/s]Traceback (most recent call last):
  File "train.py", line 35, in <module>
    trainer.train()
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1547, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1791, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2539, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 2571, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 434, in reraise
    raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 1038, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 881, in forward
    output_attentions=output_attentions
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 580, in forward
    output_attentions=output_attentions
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 413, in forward
    cos, sin = self.rotary_emb(q1, seq_len=position_ids.max() + 1)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1118, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/220f772e9a2d7a55701d9b49bf2efc618acc3b56/modeling_chatglm.py", line 184, in forward
    return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]
TypeError: 'NoneType' object is not subscriptable

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support: true

Anything else?

No response

MrToy avatar Mar 17 '23 12:03 MrToy