Baichuan-7B icon indicating copy to clipboard operation
Baichuan-7B copied to clipboard

[BUG] CUDA Out of Memory when eval model.

Open Crystalxd opened this issue 1 year ago • 5 comments

Required prerequisites

System information

conda environment torch=2.0.1 transformers=4.29.2 ...

Problem description

I used A100(80G) to run the evaluate_zh.py script for evaluating baichuan model, but it occupied abundant GPU memory up to overflow. Then I found the model loaded without eval mode, meanwhile, it inferred without no_grad.

Reproducible example code

The Python snippets:

[https://github.com/baichuan-inc/Baichuan-7B/blob/6f3ef4633a90c2d8a3e0763d0dec1b8dc11588f5/evaluation/evaluate_zh.py#L97C13-L97C13](url)
self.model = model.eval()

https://github.com/baichuan-inc/Baichuan-7B/blob/6f3ef4633a90c2d8a3e0763d0dec1b8dc11588f5/evaluation/evaluate_zh.py#L103
Add on this line:
@torch.inference_mode()

Command lines:


Extra dependencies:


Steps to reproduce:

Traceback

No response

Expected behavior

No response

Additional context

No response

Checklist

  • [X] I have provided all relevant and necessary information above.
  • [X] I have chosen a suitable title for this issue.

Crystalxd avatar Sep 12 '23 02:09 Crystalxd

Thank you. It works!!!

Guanze-Chen avatar Oct 19 '23 12:10 Guanze-Chen

Thanks!

ICanFlyGFC avatar Dec 07 '23 14:12 ICanFlyGFC

您的邮件已经收到,会尽快回复您

Guanze-Chen avatar Dec 07 '23 14:12 Guanze-Chen

我在训练模型过程中,脚本默认使用gpu0,怎么调换到gpu1上面?

Young-X avatar Jul 24 '24 10:07 Young-X

您的邮件已经收到,会尽快回复您

Guanze-Chen avatar Jul 24 '24 10:07 Guanze-Chen