CodeGen2 icon indicating copy to clipboard operation
CodeGen2 copied to clipboard

Can't this model be trained using multi nodes and multiple cards ?

Open Tengfei9228 opened this issue 2 years ago • 2 comments

My machine has a single node 4-card 16G graphics memory, and running the 16B model with multiple nodes will result in OOM regardless of how the number of nodes is set. Can this model be trained with multiple machines and cards Here are some logs of the training:

2023-05-04 16:21:02,954] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: c05r2n11,c05r2n16,c05r2n19,c05r3n02,c06r4n06,c06r4n09,c06r4n12,c06r4n17 [2023-05-04 16:21:02,955] [INFO] [runner.py:453:main] cmd = pdsh -f 1024 -w c05r2n11,c05r2n16,c05r2n19,c05r3n02,c06r4n06,c06r4n09,c06r4n12,c06r4n17 export NCCL_SOCKET_IFNAME=ib0; export UCX_MAX_EAGER_LANES=4; export UCX_MAX_RNDV_LANES=4; export UCX_ZCOPY_THRESH=auto; export UCX_RNDV_THRESH=auto; export UCX_DC_MLX5_NUM_DCI=16; export NCCL_IB_HCA=mlx5_0; export NCCL_DEBUG=info; export PYTHONPATH=/work/home/actvg1ue59/CodeGen-main; cd /work/home/actvg1ue59/CodeGen-main; /work/home/actvg1ue59/miniconda3/envs/torch/bin/python -u -m deepspeed.launcher.launch --world_info=eyJjMDVyMm4xMSI6IFswLCAxLCAyLCAzXSwgImMwNXIybjE2IjogWzAsIDEsIDIsIDNdLCAiYzA1cjJuMTkiOiBbMCwgMSwgMiwgM10sICJjMDVyM24wMiI6IFswLCAxLCAyLCAzXSwgImMwNnI0bjA2IjogWzAsIDEsIDIsIDNdLCAiYzA2cjRuMDkiOiBbMCwgMSwgMiwgM10sICJjMDZyNG4xMiI6IFswLCAxLCAyLCAzXSwgImMwNnI0bjE3IjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=c05r2n11 --master_port=9901 train.py c05r3n02: Warning: Permanently added 'c05r3n02,10.3.5.43' (ECDSA) to the list of known hosts.^M c05r2n16: Warning: Permanently added 'c05r2n16,10.3.5.37' (ECDSA) to the list of known hosts.^M c05r2n19: Warning: Permanently added 'c05r2n19,10.3.5.40' (ECDSA) to the list of known hosts.^M c06r4n06: Warning: Permanently added 'c06r4n06,10.3.6.67' (ECDSA) to the list of known hosts.^M c06r4n09: Warning: Permanently added 'c06r4n09,10.3.6.70' (ECDSA) to the list of known hosts.^M c06r4n12: Warning: Permanently added 'c06r4n12,10.3.6.73' (ECDSA) to the list of known hosts.^M c06r4n17: Warning: Permanently added 'c06r4n17,10.3.6.78' (ECDSA) to the list of known hosts.^M c05r2n11: Currently Loaded Modulefiles: c05r2n11: 1) compiler/devtoolset/7.3.1 3) compiler/dtk/22.10.1 c05r2n11: 2) mpi/hpcx/gcc-7.3.1 c05r2n11: [2023-05-04 16:21:07,883] [INFO] [launch.py:96:main] 0 NCCL_SOCKET_IFNAME=ib0 c05r2n11: [2023-05-04 16:21:07,883] [INFO] [launch.py:96:main] 0 NCCL_IB_HCA=mlx5_0 c05r2n11: [2023-05-04 16:21:07,883] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=info c05r2n11: [2023-05-04 16:21:07,883] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n11: [2023-05-04 16:21:07,883] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=0 c05r2n11: [2023-05-04 16:21:07,884] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c05r2n11: [2023-05-04 16:21:07,884] [INFO] [launch.py:123:main] dist_world_size=32 c05r2n11: [2023-05-04 16:21:07,884] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=1 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:96:main] 7 NCCL_IB_HCA=mlx5_0 c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 2 NCCL_IB_HCA=mlx5_0 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:96:main] 6 NCCL_DEBUG=info c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:96:main] 7 NCCL_DEBUG=info c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 2 NCCL_DEBUG=info c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=4 c06r4n09: [2023-05-04 16:21:36,546] [INFO] [launch.py:96:main] 5 NCCL_IB_HCA=mlx5_0 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 3 NCCL_IB_HCA=mlx5_0 c06r4n09: [2023-05-04 16:21:36,546] [INFO] [launch.py:96:main] 5 NCCL_DEBUG=info c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=6 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:123:main] dist_world_size=32 c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 3 NCCL_DEBUG=info c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=2 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:123:main] dist_world_size=32 c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=3 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=7 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:123:main] dist_world_size=32 c05r3n02: [2023-05-04 16:21:36,532] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,532] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n09: [2023-05-04 16:21:36,555] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c06r4n09: [2023-05-04 16:21:36,555] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=5 c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=1 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:96:main] 7 NCCL_IB_HCA=mlx5_0 c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 2 NCCL_IB_HCA=mlx5_0 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:96:main] 6 NCCL_DEBUG=info c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:96:main] 7 NCCL_DEBUG=info c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 2 NCCL_DEBUG=info c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=4 c06r4n09: [2023-05-04 16:21:36,546] [INFO] [launch.py:96:main] 5 NCCL_IB_HCA=mlx5_0 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 3 NCCL_IB_HCA=mlx5_0 c06r4n09: [2023-05-04 16:21:36,546] [INFO] [launch.py:96:main] 5 NCCL_DEBUG=info c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=6 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:123:main] dist_world_size=32 c05r2n16: [2023-05-04 16:21:36,542] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n06: [2023-05-04 16:21:36,543] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n12: [2023-05-04 16:21:36,538] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:96:main] 3 NCCL_DEBUG=info c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=2 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:123:main] dist_world_size=32 c05r2n19: [2023-05-04 16:21:36,531] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=3 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=7 c05r3n02: [2023-05-04 16:21:36,531] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [4, 5, 6, 7], 'c05r2n19': [8, 9, 10, 11], 'c05r3n02': [12, 13, 14, 15], 'c06r4n06': [16, 17, 18, 19], 'c06r4n09': [20, 21, 22, 23], 'c06r4n12': [24, 25, 26, 27], 'c06r4n17': [28, 29, 30, 31]}) c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:123:main] dist_world_size=32 c05r3n02: [2023-05-04 16:21:36,532] [INFO] [launch.py:123:main] dist_world_size=32 c06r4n17: [2023-05-04 16:21:36,534] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c05r3n02: [2023-05-04 16:21:36,532] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 c06r4n09: [2023-05-04 16:21:36,555] [INFO] [launch.py:103:main] WORLD INFO DICT: {'c05r2n11': [0, 1, 2, 3], 'c05r2n16': [0, 1, 2, 3], 'c05r2n19': [0, 1, 2, 3], 'c05r3n02': [0, 1, 2, 3], 'c06r4n06': [0, 1, 2, 3], 'c06r4n09': [0, 1, 2, 3], 'c06r4n12': [0, 1, 2, 3], 'c06r4n17': [0, 1, 2, 3]} c06r4n09: [2023-05-04 16:21:36,555] [INFO] [launch.py:109:main] nnodes=8, num_local_procs=4, node_rank=5 c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model c05r2n11: load_checkpoint_in_model( c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model c05r2n11: load_checkpoint_in_model( c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 946, in load_checkpoint_in_model c05r2n11: set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype) c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device c05r2n11: set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype) c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device c05r2n11: set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype) c05r2n11: c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device c05r2n11: File "/work/home/actvg1ue59/miniconda3/envs/torch/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device c05r2n11: new_value = value.to(device)new_value = value.to(device) c05r2n11: c05r2n11: new_value = value.to(device) c05r2n11: new_value = value.to(device) c05r2n11: RuntimeErrorRuntimeError: : HIP out of memory. Tried to allocate 576.00 MiB (GPU 3; 15.98 GiB total capacity; 15.83 GiB already allocated; 0 bytes free; 15.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONFHIP out of memory. Tried to allocate 576.00 MiB (GPU 1; 15.98 GiB total capacity; 15.83 GiB already allocated; 0 bytes free; 15.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF c05r2n11: RuntimeError c05r2n11: RuntimeError: : HIP out of memory. Tried to allocate 576.00 MiB (GPU 2; 15.98 GiB total capacity; 15.83 GiB already allocated; 0 bytes free; 15.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONFHIP out of memory. Tried to allocate 576.00 MiB (GPU 0; 15.98 GiB total capacity; 15.83 GiB already allocated; 0 bytes free; 15.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Tengfei9228 avatar May 04 '23 08:05 Tengfei9228

16GB graphics card may be somewhat limiting. Perhaps you could try CPU-offloading or offloading to NVMe to save more memory, as well as multi-node training.

rooa avatar May 04 '23 18:05 rooa

What I mean is that I have used multiple nodes on my end, but still have OOM. Is it because the code itself does not support it? My multi card running mode is: deepspeed --num_nodes 16 --hostfile=./hostfiles/hostfile-dl-$SLURM_JOB_ID --num_gpus 4 --master_addr $dist_url --master_port=9901 train.py

Tengfei9228 avatar May 05 '23 03:05 Tengfei9228