Baichuan-7B [Question] a100 80g单卡训练还 out of memory

[Question] a100 80g单卡训练还 out of memory

Open wac81 opened this issue 1 year ago • 9 comments

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[ ] Consider asking first in a Discussion.

Questions

[Question] a100 80g单卡训练还 out of memory

双卡40g stage2 就没有问题。这是dp的配置问题？还是版本问题？

团队有没有做过相关测试

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

Jun 26 '23 08:06 wac81

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

Jun 26 '23 09:06 wac81

batch size多大

Jun 29 '23 06:06 formath

batch size多大

Jul 02 '23 03:07 wac81

max_len 即使降到512 也out，怀疑是代码本身的问题？或者dp版本问题?

Jul 02 '23 03:07 wac81

Jul 03 '23 03:07 jaweii

请问能说一下data_dir下面的训练语料有多少吗？什么格式截个图，谢谢！

Jul 15 '23 20:07 2132660698

txt，用的是wiki ，

root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# lltotal 10004‌ drwxr-xr-x 2 root root 180 Jun 25 17:07 ./drwxr-xr-x 12 root root 4096 Jun 27 18:37 ../ -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.0-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.1 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.2-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.3 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.4-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.5 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.6-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.7 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.8-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.9

root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# tail log.3 
      <text xml:space="preserve">: ''The Amazing Spider-Man is a comics series. For other uses see [[The Amazing Spider-Man (disambiguation)]].''

[[Image:Firstissue.jpg|thumb|Cover to ''The Amazing Spider-Man'' #1 (Volume 1), March 1963, by [[Steve Ditko]].]]
'''''The Amazing Spider-Man''''' is the title of a [[comic book]] published by [[Marvel Comics]], a [[television program]] and a daily [[newspaper]] [[comic strip]]. All three feature the adventures of the [[superhero]] [[Spider-Man]].

==Comic book ==

Spider-Man originally appeared in issue #15 of the comic book ''[[Amazing Fantasy]]'', its final issue.  The series was cancelled with that issue, but response to the character was so positive that the new title, ''The Amazing Spider-Man'' was launched, issue #1 appearing in March 1963.

The character was created by writer/editor [[Stan Lee]] and artist/cowriter [[Steve Ditko]]

Jul 17 '23 07:07 wac81

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

这看着好像在加载模型参数的时候内存不够，可以考虑增加内存。

Aug 03 '23 02:08 hingkan

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

这看着好像在加载模型参数的时候内存不够，可以考虑增加内存。

内存200个G

Aug 24 '23 10:08 wac81

Baichuan-7B Baichuan-7B copied to clipboard

[Question] a100 80g单卡训练还 out of memory

Required prerequisites

Questions

Checklist

Baichuan-7B
Baichuan-7B copied to clipboard