CUDA error: OS call failed or operation not supported on this OS when initializing pipeline

Open x54-729 opened this issue 2 years ago • 3 comments

Recently I as working on pipeline version of CodeGen, but I got the following error when running my program:

[2023-04-21 16:15:19,068] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 16:15:19,159] [INFO] [runner.py:540:main] cmd = /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=27643 --enable_each_rank_log=None train.py
[2023-04-21 16:15:26,697] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=eth0
[2023-04-21 16:15:26,697] [INFO] [launch.py:222:main] 0 NCCL_IB_HCA=mlx5_0,mlx5_2
[2023-04-21 16:15:26,697] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-04-21 16:15:26,698] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-04-21 16:15:26,698] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-04-21 16:15:26,698] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-21 16:15:26,698] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-04-21 16:15:36,322] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=1, data=0): 1, ProcessCoord(pipe=2, data=0): 2, ProcessCoord(pipe=3, data=0): 3, ProcessCoord(pipe=4, data=0): 4, ProcessCoord(pipe=5, data=0): 5, ProcessCoord(pipe=6, data=0): 6, ProcessCoord(pipe=7, data=0): 7}
[2023-04-21 16:15:44,150] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=5
     0: EmbeddingPipe
     1: pre_forward
     2: CodeGenBlockPipe
     3: CodeGenBlockPipe
     4: CodeGenBlockPipe
stage=1 layers=5
     5: CodeGenBlockPipe
     6: CodeGenBlockPipe
     7: CodeGenBlockPipe
     8: CodeGenBlockPipe
     9: CodeGenBlockPipe
stage=2 layers=5
    10: CodeGenBlockPipe
    11: CodeGenBlockPipe
    12: CodeGenBlockPipe
    13: CodeGenBlockPipe
    14: CodeGenBlockPipe
stage=3 layers=5
    15: CodeGenBlockPipe
    16: CodeGenBlockPipe
    17: CodeGenBlockPipe
    18: CodeGenBlockPipe
    19: CodeGenBlockPipe
stage=4 layers=5
    20: CodeGenBlockPipe
    21: CodeGenBlockPipe
    22: CodeGenBlockPipe
    23: CodeGenBlockPipe
    24: CodeGenBlockPipe
stage=5 layers=5
    25: CodeGenBlockPipe
    26: CodeGenBlockPipe
    27: CodeGenBlockPipe
    28: CodeGenBlockPipe
    29: CodeGenBlockPipe
stage=6 layers=5
    30: CodeGenBlockPipe
    31: CodeGenBlockPipe
    32: CodeGenBlockPipe
    33: CodeGenBlockPipe
    34: CodeGenBlockPipe
stage=7 layers=3
    35: CodeGenBlockPipe
    36: LayerNormPipe
    37: CodeGenLMHead
  loss: <lambda>
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
[2023-04-21 16:16:20,822] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
[2023-04-21 16:16:20,996] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.531783103942871 seconds
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.6049892902374268 seconds
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.6580934524536133 seconds
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-04-21 16:16:25,860] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-04-21 16:16:25,861] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-04-21 16:16:25,861] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-04-21 16:16:25,861] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 1 optimizer
[2023-04-21 16:16:25,861] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000
[2023-04-21 16:16:25,861] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000
[2023-04-21 16:16:25,862] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: True
[2023-04-21 16:16:25,862] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6496522426605225 seconds
Loading extension module utils...
Time to load utils op: 0.10721945762634277 seconds
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.6540791988372803 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.3316712379455566 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
ninja: no work to do.
Loading extension module utils...
Loading extension module cpu_adam...
Time to load utils op: 0.6327915191650391 seconds
Time to load cpu_adam op: 1.6265079975128174 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.3759047985076904 seconds
Rank: 7 partition count [1] and sizes[(270611456, False)] 
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Rank: 0 partition count [1] and sizes[(497089536, False)] 
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6700611114501953 seconds
Loading extension module utils...
Time to load utils op: 0.6102221012115479 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Rank: 4 partition count [1] and sizes[(566338560, False)] 
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6231212615966797 seconds
Loading extension module utils...
Time to load utils op: 0.40987205505371094 seconds
Rank: 5 partition count [1] and sizes[(566338560, False)] 
Rank: 2 partition count [1] and sizes[(566338560, False)] 
Found cached dataset parquet (/mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Loading cached split indices for dataset at /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-4de47df4d6403202.arrow and /mnt/petrelfs/xingshuhao.dispatch/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-715f206eec35a791/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f27892fb794db1ff.arrow
Rank: 1 partition count [1] and sizes[(566338560, False)] 
Rank: 3 partition count [1] and sizes[(566338560, False)] 
Installed CUDA version 11.2 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 1.52874755859375 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.010000, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.6605522632598877 seconds
Rank: 6 partition count [1] and sizes[(566338560, False)] 
[2023-04-21 16:16:46,131] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-04-21 16:16:46,132] [INFO] [utils.py:786:see_memory_usage] MA 2.45 GB         Max_MA 2.45 GB         CA 2.46 GB         Max_CA 2 GB 
[2023-04-21 16:16:46,133] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 176.94 GB, percent = 17.6%
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.006943464279174805 seconds
[2023-04-21 16:16:47,713] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-04-21 16:16:47,714] [INFO] [utils.py:786:see_memory_usage] MA 2.45 GB         Max_MA 2.45 GB         CA 2.46 GB         Max_CA 2 GB 
[2023-04-21 16:16:47,715] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 206.52 GB, percent = 20.5%
[2023-04-21 16:16:47,715] [INFO] [stage_1_and_2.py:489:__init__] optimizer state initialized
[2023-04-21 16:16:47,847] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-04-21 16:16:47,848] [INFO] [utils.py:786:see_memory_usage] MA 2.45 GB         Max_MA 2.45 GB         CA 2.46 GB         Max_CA 2 GB 
[2023-04-21 16:16:47,849] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 211.42 GB, percent = 21.0%
[2023-04-21 16:16:47,850] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-04-21 16:16:47,850] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-21 16:16:47,850] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-21 16:16:47,850] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.01], mom=[(0.9, 0.999)]
[2023-04-21 16:16:47,851] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-04-21 16:16:47,851] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-04-21 16:16:47,851] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-21 16:16:47,851] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-04-21 16:16:47,851] [INFO] [config.py:957:print]   amp_params ................... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f1d854214c0>
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   disable_allgather ............ False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   dump_state ................... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   dynamic_loss_scale_args ...... None
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_enabled ........... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_gas_boundary_resolution  1
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_layer_num ......... 0
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_max_iter .......... 100
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_stability ......... 1e-06
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_tol ............... 0.01
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   eigenvalue_verbose ........... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   fp16_auto_cast ............... None
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   fp16_enabled ................. False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   global_rank .................. 0
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   grad_accum_dtype ............. None
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   gradient_accumulation_steps .. 16
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   gradient_clipping ............ 0.0
[2023-04-21 16:16:47,852] [INFO] [config.py:957:print]   gradient_predivide_factor .... 1.0
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   initial_dynamic_scale ........ 65536
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   load_universal_checkpoint .... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   loss_scale ................... 0
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   memory_breakdown ............. False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   optimizer_name ............... adam
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   optimizer_params ............. {'lr': 0.01}
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   pld_params ................... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   prescale_gradients ........... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   scheduler_name ............... None
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   scheduler_params ............. None
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   sparse_attention ............. None
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   sparse_gradients_enabled ..... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   steps_per_print .............. 2000
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   train_batch_size ............. 64
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   train_micro_batch_size_per_gpu  4
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   use_node_local_storage ....... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   world_size ................... 1
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  True
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   zero_enabled ................. True
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. False
[2023-04-21 16:16:47,853] [INFO] [config.py:957:print]   zero_optimization_stage ...... 1
[2023-04-21 16:16:47,854] [INFO] [config.py:943:print_user_config]   json = {
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true, 
    "zero_force_ds_cpu_optimizer": false, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.01
        }
    }, 
    "zero_optimization": {
        "stage": 1, 
        "offload_optimizer": {
            "device": "cpu"
        }
    }, 
    "gradient_accumulation_steps": 16, 
    "steps_per_print": 2.000000e+03, 
    "train_micro_batch_size_per_gpu": 4
}
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.01209115982055664 seconds
[2023-04-21 16:16:47,866] [INFO] [engine.py:81:__init__] CONFIG: micro_batches=16 micro_batch_size=4
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.006413698196411133 seconds
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0018720626831054688 seconds
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0019464492797851562 seconds
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0015425682067871094 seconds
Using /mnt/petrelfs/xingshuhao.dispatch/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0014889240264892578 seconds
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/petrelfs/xingshuhao.dispatch/collie/examples/codegen_pipeline/train.py: │
│ 110 in <module>                                                              │
│                                                                              │
│   107 │   # optimizer = torch.optim.AdamW(model_parameters, lr=collie_args.l │
│   108 │   optimizer = None                                                   │
│   109 │   lr_scheduler = None                                                │
│ ❱ 110 │   trainer = PipelineTrainer(                                         │
│   111 │   │   model=codegen_pipeline,                                        │
│   112 │   │   collie_args=collie_args,                                       │
│   113 │   │   data_collator={                                                │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/collie/examples/codegen_pipeline/../../col │
│ lie/trainer/deepspeed_pipeline_trainer.py:52 in __init__                     │
│                                                                              │
│    49 │   │   │   raise ModuleNotFoundError(                                 │
│    50 │   │   │   │   "Detected DeepSpeed is not installed. See https://gith │
│    51 │   │                                                                  │
│ ❱  52 │   │   self.engine, self.optimizer, _, self.lr_scheduler = deepspeed. │
│    53 │   │   │   config=collie_args.deepspeed,                              │
│    54 │   │   │   model=model,                                               │
│    55 │   │   │   lr_scheduler=lr_scheduler,                                 │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/__init__.py:171 in initialize                           │
│                                                                              │
│   168 │   │   assert mpu is None, "mpu must be None with pipeline parallelis │
│   169 │   │   mpu = model.mpu()                                              │
│   170 │   │   config_class = DeepSpeedConfig(config, mpu)                    │
│ ❱ 171 │   │   engine = PipelineEngine(args=args,                             │
│   172 │   │   │   │   │   │   │   │   model=model,                           │
│   173 │   │   │   │   │   │   │   │   optimizer=optimizer,                   │
│   174 │   │   │   │   │   │   │   │   model_parameters=model_parameters,     │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/pipe/engine.py:53 in __init__                   │
│                                                                              │
│     50 │   DTYPE_TO_ID = {dtype: id_ for id_, dtype in enumerate(ID_TO_DTYPE │
│     51 │                                                                     │
│     52 │   def __init__(self, has_bool_tensors=False, *super_args, **super_k │
│ ❱   53 │   │   super().__init__(*super_args, **super_kwargs)                 │
│     54 │   │   assert isinstance(self.module, PipelineModule), "model must b │
│     55 │   │                                                                 │
│     56 │   │   assert self.zero_optimization_stage() < 2, "ZeRO-2 and ZeRO-3 │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/engine.py:328 in __init__                       │
│                                                                              │
│    325 │   │   │   model_parameters = list(model_parameters)                 │
│    326 │   │                                                                 │
│    327 │   │   if has_optimizer:                                             │
│ ❱  328 │   │   │   self._configure_optimizer(optimizer, model_parameters)    │
│    329 │   │   │   self._configure_lr_scheduler(lr_scheduler)                │
│    330 │   │   │   self._report_progress(0)                                  │
│    331 │   │   elif self.zero_optimization():                                │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/engine.py:1187 in _configure_optimizer          │
│                                                                              │
│   1184 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_opt │
│   1185 │   │                                                                 │
│   1186 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                    │
│ ❱ 1187 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_opt │
│   1188 │   │   elif optimizer_wrapper == AMP:                                │
│   1189 │   │   │   amp_params = self.amp_params()                            │
│   1190 │   │   │   log_dist(f"Initializing AMP with these params: {amp_param │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/engine.py:1418 in _configure_zero_optimizer     │
│                                                                              │
│   1415 │   │   │   │   if overlap_comm:                                      │
│   1416 │   │   │   │   │   logger.warning("Pipeline parallelism does not sup │
│   1417 │   │   │   │   │   overlap_comm = False                              │
│ ❱ 1418 │   │   │   optimizer = DeepSpeedZeroOptimizer(                       │
│   1419 │   │   │   │   optimizer,                                            │
│   1420 │   │   │   │   self.param_names,                                     │
│   1421 │   │   │   │   timers=timers,                                        │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/zero/stage_1_and_2.py:485 in __init__           │
│                                                                              │
│    482 │   │   self.dynamic_loss_scale = self.loss_scaler.dynamic            │
│    483 │   │                                                                 │
│    484 │   │   see_memory_usage("Before initializing optimizer states", forc │
│ ❱  485 │   │   self.initialize_optimizer_states()                            │
│    486 │   │   see_memory_usage("After initializing optimizer states", force │
│    487 │   │                                                                 │
│    488 │   │   if dist.get_rank() == 0:                                      │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/runtime/zero/stage_1_and_2.py:611 in                    │
│ initialize_optimizer_states                                                  │
│                                                                              │
│    608 │   │   │   single_grad_partition = torch.zeros(int(self.partition_si │
│    609 │   │   │   │   │   │   │   │   │   │   │   │   dtype=self.single_par │
│    610 │   │   │   │   │   │   │   │   │   │   │   │   device=self.device)   │
│ ❱  611 │   │   │   self.single_partition_of_fp32_groups[i].grad = get_accele │
│    612 │   │   │   │   single_grad_partition) if self.cpu_offload else singl │
│    613 │   │                                                                 │
│    614 │   │   self.optimizer.step()                                         │
│                                                                              │
│ /mnt/petrelfs/xingshuhao.dispatch/anaconda3/envs/deepspeed/lib/python3.8/sit │
│ e-packages/deepspeed/accelerator/cuda_accelerator.py:217 in pin_memory       │
│                                                                              │
│   214 │   │   return torch.cuda.LongTensor                                   │
│   215 │                                                                      │
│   216 │   def pin_memory(self, tensor):                                      │
│ ❱ 217 │   │   return tensor.pin_memory()                                     │
│   218 │                                                                      │
│   219 │   def on_accelerator(self, tensor):                                  │
│   220 │   │   device_str = str(tensor.device)                                │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: OS call failed or operation not supported on this OS
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

smallest code that can reproduce the error:

import sys
sys.path.append("../../")
import os

import torch
import deepspeed
from deepspeed.pipe import PipelineModule

from transformers import CodeGenConfig
from codegen_pipeline import CodeGenForCausalLMPipe # my pipeline model

if __name__ == "__main__":
    # CUDA_LAUNCH_BLOCKING=1 bash test.sh
    deepspeed.init_distributed("nccl")
    load_path = "Salesforce/codegen-16B-mono"

    config = CodeGenConfig.from_pretrained(load_path)
    config.gradient_checkpointing = True
    # config.n_layer = 4
    # config.n_ctx = 512
    config.n_embd = 3072
    # config.n_positions = 512
    config.n_head = 8

    codegen_pipeline = PipelineModule(
        CodeGenForCausalLMPipe(config=config).to_layers(),
        num_stages=int(os.environ["WORLD_SIZE"])
    )

    deepspeed.initialize(
        config="ds_config.json",
        model=codegen_pipeline,
    )

config:

{
    "fp16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,

    "optimizer": {
        "type": "Adam",
        "params": {
          "lr": 0.01
        }
      },

    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "cpu"
        }
    },
    "gradient_accumulation_steps": 16,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 4
}

This happens when I try to run a demo with codegen-16B-mono. What is weird is that when the parameters are smaller(n_embd=2048, n_head=8), this error disappears. The memory of my machine is enough (About 800G available using command free), and gpu memory is enough too.

If zero optimiation is of stage 0, the error disappers.

What may cause this error? Is it possible to cause by some inproperty structure of model? I really don't know how to debug this.

torch 2.0.0+cu117
deepspeed 0.9.0

Apr 21 '23 08:04 x54-729

By the way, the error only occurs when I use zero optimizer stage 1. If zero offload is false, there would be another error Illegal memory access

Apr 21 '23 08:04 x54-729

Hi @x54-729 I met the exactly same problem and I notice that it will work fine when I disable the offload optimizer feature. Not sure why.

May 07 '23 10:05 00INDEX

See #3481

May 07 '23 14:05 00INDEX