Baichuan-7B [BUG] CUDA Out of Memory when eval model.

[BUG] CUDA Out of Memory when eval model.

Open Crystalxd opened this issue 1 year ago • 5 comments

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

System information

conda environment torch=2.0.1 transformers=4.29.2 ...

Problem description

I used A100(80G) to run the evaluate_zh.py script for evaluating baichuan model, but it occupied abundant GPU memory up to overflow. Then I found the model loaded without eval mode, meanwhile, it inferred without no_grad.

Reproducible example code

The Python snippets:

[https://github.com/baichuan-inc/Baichuan-7B/blob/6f3ef4633a90c2d8a3e0763d0dec1b8dc11588f5/evaluation/evaluate_zh.py#L97C13-L97C13](url)
self.model = model.eval()

https://github.com/baichuan-inc/Baichuan-7B/blob/6f3ef4633a90c2d8a3e0763d0dec1b8dc11588f5/evaluation/evaluate_zh.py#L103
Add on this line:
@torch.inference_mode()

Command lines:

Extra dependencies:

Steps to reproduce:

Traceback

No response

Expected behavior

No response

Additional context

No response

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

Sep 12 '23 02:09 Crystalxd

Thank you. It works!!!

Oct 19 '23 12:10 Guanze-Chen

Thanks!

Dec 07 '23 14:12 ICanFlyGFC

您的邮件已经收到，会尽快回复您

Dec 07 '23 14:12 Guanze-Chen

我在训练模型过程中，脚本默认使用gpu0，怎么调换到gpu1上面？

Jul 24 '24 10:07 Young-X

您的邮件已经收到，会尽快回复您

Jul 24 '24 10:07 Guanze-Chen

Baichuan-7B Baichuan-7B copied to clipboard

[BUG] CUDA Out of Memory when eval model.

Required prerequisites

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Checklist

Baichuan-7B
Baichuan-7B copied to clipboard