Baichuan-7B icon indicating copy to clipboard operation
Baichuan-7B copied to clipboard

[Question] a100 80g单卡训练还 out of memory

Open wac81 opened this issue 1 year ago • 9 comments

Required prerequisites

Questions

[Question] a100 80g单卡训练还 out of memory

双卡40g stage2 就没有问题。 这是dp的配置问题?还是版本问题?

团队有没有做过相关测试

Checklist

  • [X] I have provided all relevant and necessary information above.
  • [X] I have chosen a suitable title for this issue.

wac81 avatar Jun 26 '23 08:06 wac81

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

wac81 avatar Jun 26 '23 09:06 wac81

batch size多大

formath avatar Jun 29 '23 06:06 formath

batch size多大

1

wac81 avatar Jul 02 '23 03:07 wac81

max_len 即使降到512 也out,怀疑是代码本身的问题?或者dp版本问题?

wac81 avatar Jul 02 '23 03:07 wac81

+1

jaweii avatar Jul 03 '23 03:07 jaweii

请问能说一下data_dir下面的训练语料有多少吗?什么格式截个图,谢谢!

2132660698 avatar Jul 15 '23 20:07 2132660698

txt,用的是wiki ,

root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# lltotal 10004‌ drwxr-xr-x 2 root root 180 Jun 25 17:07 ./drwxr-xr-x 12 root root 4096 Jun 27 18:37 ../ -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.0-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.1 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.2-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.3 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.4-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.5 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.6-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.7 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.8-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.9

root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# tail log.3 
      <text xml:space="preserve">: ''The Amazing Spider-Man is a comics series. For other uses see [[The Amazing Spider-Man (disambiguation)]].''

[[Image:Firstissue.jpg|thumb|Cover to ''The Amazing Spider-Man'' #1 (Volume 1), March 1963, by [[Steve Ditko]].]]
'''''The Amazing Spider-Man''''' is the title of a [[comic book]] published by [[Marvel Comics]], a [[television program]] and a daily [[newspaper]] [[comic strip]]. All three feature the adventures of the [[superhero]] [[Spider-Man]].

==Comic book ==

Spider-Man originally appeared in issue #15 of the comic book ''[[Amazing Fantasy]]'', its final issue.  The series was cancelled with that issue, but response to the character was so positive that the new title, ''The Amazing Spider-Man'' was launched, issue #1 appearing in March 1963.

The character was created by writer/editor [[Stan Lee]] and artist/cowriter [[Steve Ditko]]

wac81 avatar Jul 17 '23 07:07 wac81

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

这看着好像在加载模型参数的时候内存不够,可以考虑增加内存。

hingkan avatar Aug 03 '23 02:08 hingkan

[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9

这看着好像在加载模型参数的时候内存不够,可以考虑增加内存。

内存200个G

wac81 avatar Aug 24 '23 10:08 wac81