DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Step2: memory allocation of 2097152 bytes failed

Open YukinoshitaKaren opened this issue 1 year ago • 3 comments

when I run step2 using 'bash training_scripts/single_node/run_350m.sh' meet error

[2023-04-16 21:36:09,031] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-04-16 21:36:09,031] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-04-16 21:36:09,031] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-16 21:36:09,031] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-16 21:36:16,042] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
memory allocation of memory allocation of 20971522097152 bytes failed
 bytes failed
memory allocation of 2097152 bytes failed
[2023-04-16 21:54:13,235] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62215
[2023-04-16 21:54:16,017] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62216
[2023-04-16 21:54:16,029] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62217
[2023-04-16 21:54:18,477] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62218
[2023-04-16 21:54:21,046] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62219
[2023-04-16 21:54:21,057] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62220
[2023-04-16 21:54:21,060] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62221
[2023-04-16 21:54:23,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 62222

I have already allocate 50g memory, but still failed

YukinoshitaKaren avatar Apr 17 '23 02:04 YukinoshitaKaren

@YukinoshitaKaren, can you please try with single gpu?

tjruwase avatar Apr 18 '23 19:04 tjruwase

I have already allocate 50g memory, but still failed

Can you explain what this means?

tjruwase avatar Apr 18 '23 19:04 tjruwase

I use slurm command '##sbatch mem=50G' to ask for memory

YukinoshitaKaren avatar Apr 24 '23 02:04 YukinoshitaKaren

[2023-04-25 18:15:23,050] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74406 [2023-04-25 18:15:44,760] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74407 [2023-04-25 18:16:15,378] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74408 [2023-04-25 18:16:15,447] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74409 [2023-04-25 18:16:30,320] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74410 [2023-04-25 18:17:02,671] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 74411

step2: same question. The script is aborted with no error message.

Aurora-6 avatar Apr 25 '23 10:04 Aurora-6