Baichuan-7B
Baichuan-7B copied to clipboard
[Question] a100 80g单卡训练还 out of memory
Required prerequisites
- [X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a Discussion.
Questions
[Question] a100 80g单卡训练还 out of memory
双卡40g stage2 就没有问题。 这是dp的配置问题?还是版本问题?
团队有没有做过相关测试
Checklist
- [X] I have provided all relevant and necessary information above.
- [X] I have chosen a suitable title for this issue.
[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9
batch size多大
batch size多大
1
max_len 即使降到512 也out,怀疑是代码本身的问题?或者dp版本问题?
+1
请问能说一下data_dir下面的训练语料有多少吗?什么格式截个图,谢谢!
txt,用的是wiki ,
root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# lltotal 10004 drwxr-xr-x 2 root root 180 Jun 25 17:07 ./drwxr-xr-x 12 root root 4096 Jun 27 18:37 ../ -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.0-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.1 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.2-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.3 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.4-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.5 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.6-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.7 -rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.8-rw-r--r-- 1 root root 1024000 Jun 25 17:07 log.9
root@I131672d9da00f017a4:/hy-tmp/baichuan-7B/data_dir# tail log.3
<text xml:space="preserve">: ''The Amazing Spider-Man is a comics series. For other uses see [[The Amazing Spider-Man (disambiguation)]].''
[[Image:Firstissue.jpg|thumb|Cover to ''The Amazing Spider-Man'' #1 (Volume 1), March 1963, by [[Steve Ditko]].]]
'''''The Amazing Spider-Man''''' is the title of a [[comic book]] published by [[Marvel Comics]], a [[television program]] and a daily [[newspaper]] [[comic strip]]. All three feature the adventures of the [[superhero]] [[Spider-Man]].
==Comic book ==
Spider-Man originally appeared in issue #15 of the comic book ''[[Amazing Fantasy]]'', its final issue. The series was cancelled with that issue, but response to the character was so positive that the new title, ''The Amazing Spider-Man'' was launched, issue #1 appearing in March 1963.
The character was created by writer/editor [[Stan Lee]] and artist/cowriter [[Steve Ditko]]
[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9
这看着好像在加载模型参数的时候内存不够,可以考虑增加内存。
[2023-06-26 17:04:13,047] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-06-26 17:04:13,057] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-06-26 17:04:13,057] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:133:init] Reduce bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:134:init] Allgather bucket size 500,000,000 [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:135:init] CPU Offload: True [2023-06-26 17:04:13,057] [INFO] [stage_1_and_2.py:136:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(7000559616, False)] [2023-06-26 17:04:50,638] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-06-26 17:04:50,639] [INFO] [utils.py:786:see_memory_usage] MA 13.59 GB Max_MA 13.59 GB CA 13.59 GB Max_CA 14 GB [2023-06-26 17:04:50,639] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 70.03 GB, percent = 7.0% [2023-06-26 17:05:34,900] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 762262 [2023-06-26 17:05:34,901] [ERROR] [launch.py:321:sigkill_handler] ['/usr/local/miniconda3/envs/bc/bin/python', '-u', 'train.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json.bak'] exits with return code = -9
这看着好像在加载模型参数的时候内存不够,可以考虑增加内存。
内存200个G