DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Errors in cifar training and compression
I am trying to run cifar but for the one in training folder I get this error and for the one in compression a different error
Python=3.9.16 PyTorch=1.13.0 DeepSpeed=0.9.5 Cuda=11.7
Singularity> bash run_ds.sh
[2023-08-16 16:30:21,720] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-16 16:30:22,197] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1
[2023-08-16 16:30:22,214] [INFO] [runner.py:555:main] cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[2023-08-16 16:30:23,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-08-16 16:30:24,125] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-08-16 16:30:24,125] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-08-16 16:30:24,125] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-08-16 16:30:24,125] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-08-16 16:30:24,125] [INFO] [launch.py:163:main] dist_world_size=2
[2023-08-16 16:30:24,125] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-16 16:30:25,791] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None
[2023-08-16 16:30:26,245] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-16 16:30:26,245] [INFO] [comm.py:627:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-16 16:30:26,245] [INFO] [comm.py:596:init_distributed] cdb=None
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
truck dog deer cat
[2023-08-16 16:30:30,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown
Traceback (most recent call last):
File "/ocean/projects/cis230018p/ssilva/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 313, in
Singularity> bash run_compress.sh
/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-08-16 16:52:34,170] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
usage: train.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N] [--local_rank LOCAL_RANK] [--lr LR] [--lr-decay LR_DECAY]
[--lr-decay-epoch LR_DECAY_EPOCH [LR_DECAY_EPOCH ...]] [--seed S] [--weight-decay W] [--batch-norm] [--residual] [--cuda]
[--saving-folder SAVING_FOLDER] [--compression] [--path-to-model PATH_TO_MODEL] [--deepspeed] [--deepspeed_config DEEPSPEED_CONFIG]
[--deepscale] [--deepscale_config DEEPSCALE_CONFIG] [--deepspeed_mpi]
train.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 16794) of binary: /opt/miniconda/envs/env2/bin/python
Traceback (most recent call last):
File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/miniconda/envs/env2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/jet/home/ssilva/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in
For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
While the correct command does not need this flag "--deepspeed_config ds_config.json"
I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed
For the compression one: I see the same issue. Let me dig deeper and report back to you
For the one in the training folder: It looks like it is given a path to a ds config json file while it doesn't need it. This is also visible in the command that is being executed: cmd = /opt/miniconda/envs/env2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
While the correct command does not need this flag "--deepspeed_config ds_config.json"
I see you are not using the last version of DeepSpeed. Please upgrade your version to 10 and try again as I don't see this issue when running version 10: pip install --upgrade deepspeed
For the compression one: I see the same issue. Let me dig deeper and report back to you
I have just started over in a environment and upgraded deepspeed but I keep getting this issue [2023-10-07 01:37:30,894] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2023-10-07 01:37:30,894] [INFO] [launch.py:163:main] dist_world_size=2 [2023-10-07 01:37:30,894] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 libnuma: Warning: cpu argument 0-19 is out of range
<0-19> is invalid
usage: numactl [--all | -a] [--interleave= | -i
memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<20-39> is invalid
usage: numactl [--all | -a] [--interleave= | -i
memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
Just remove “--deepspeed_config ds_config.json \” in run_ds.sh