DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

when I execute the command "python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu" I got the follow result, and I try many solutions but all failure.

Open LanShanPi opened this issue 2 years ago • 8 comments

[2023-04-19 10:49:12,862] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-19 10:49:12,872] [INFO] [runner.py:540:main] cmd = /home/ubuntu/anaconda3/envs/DeepSpeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /home/ubuntu/Project/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2023-04-19 10:49:15,093] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-19 10:49:15,093] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-19 10:49:15,093] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-19 10:49:15,093] [INFO] [launch.py:247:main] dist_world_size=1 [2023-04-19 10:49:15,093] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-19 10:49:18,024] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)

0%| | 0/2 [00:00<?, ?it/s] 100%|██████████| 2/2 [00:00<00:00, 580.08it/s] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -std=c++17 -c /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -std=c++17 -c /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_89' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "main.py", line 339, in main() File "main.py", line 271, in main optimizer = AdamOptimizer(optimizer_grouped_parameters, File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 449, in load return self.jit_load(verbose) File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' [2023-04-19 10:50:23,176] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41342 [2023-04-19 10:50:23,178] [ERROR] [launch.py:434:sigkill_handler] ['/home/ubuntu/anaconda3/envs/DeepSpeed/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ubuntu/Project/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

LanShanPi avatar Apr 19 '23 03:04 LanShanPi

I execute the follow command to configuration anaconda environment: pip install deepspeed>=0.9.0 git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt

LanShanPi avatar Apr 19 '23 03:04 LanShanPi

Running with CUDA 11.7 on Ubuntu 20/04.

LanShanPi avatar Apr 19 '23 03:04 LanShanPi

I can run this script successfully on CUDA 11.3 Pytorch 1.12.0

Flywolfs avatar Apr 19 '23 03:04 Flywolfs

I execute the follow command to configuration anaconda environment: pip install deepspeed>=0.9.0 git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt

Have you tried install from source? I guess the deepspeed pip package does not sync with the latest dev version. You might want to try install from source and see if it works.

alibabadoufu avatar Apr 19 '23 04:04 alibabadoufu

meet the same issue

boyuzz avatar Apr 19 '23 08:04 boyuzz

@Flywolfs I am trying to match your version.

LanShanPi avatar Apr 19 '23 10:04 LanShanPi

Running with CUDA 11.7 on Ubuntu 20/04.

[2023-04-19 10:49:12,862] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-19 10:49:12,872] [INFO] [runner.py:540:main] cmd = /home/ubuntu/anaconda3/envs/DeepSpeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /home/ubuntu/Project/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2023-04-19 10:49:15,093] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-19 10:49:15,093] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-19 10:49:15,093] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-19 10:49:15,093] [INFO] [launch.py:247:main] dist_world_size=1 [2023-04-19 10:49:15,093] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-19 10:49:18,024] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)

0%| | 0/2 [00:00<?, ?it/s] 100%|██████████| 2/2 [00:00<00:00, 580.08it/s] huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Using /home/ubuntu/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -std=c++17 -c /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ubuntu/anaconda3/envs/DeepSpeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -std=c++17 -c /home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_89' ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "main.py", line 339, in main() File "main.py", line 271, in main optimizer = AdamOptimizer(optimizer_grouped_parameters, File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 449, in load return self.jit_load(verbose) File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/ubuntu/anaconda3/envs/DeepSpeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' [2023-04-19 10:50:23,176] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 41342 [2023-04-19 10:50:23,178] [ERROR] [launch.py:434:sigkill_handler] ['/home/ubuntu/anaconda3/envs/DeepSpeed/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/ubuntu/Project/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

i got the same problem. i solve this problem by make torch-version match cuda-version. I worked with cuda11.3 + torch1.12 From your log, you have cuda11.7, may be you can :

pip install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html

from versions: 0.4.1, 0.4.1.post2, 1.0.0, 1.0.1, 1.0.1.post2, 1.1.0, 1.2.0, 1.2.0+cpu, 1.2.0+cu92, 1.3.0, 1.3.0+cpu, 1.3.0+cu100, 1.3.0+cu92, 1.3.1, 1.3.1+cpu, 1.3.1+cu100, 1.3.1+cu92, 1.4.0, 1.4.0+cpu, 1.4.0+cu100, 1.4.0+cu92, 1.5.0, 1.5.0+cpu, 1.5.0+cu101, 1.5.0+cu92, 1.5.1, 1.5.1+cpu, 1.5.1+cu101, 1.5.1+cu92, 1.6.0, 1.6.0+cpu, 1.6.0+cu101, 1.6.0+cu92, 1.7.0, 1.7.0+cpu, 1.7.0+cu101, 1.7.0+cu110, 1.7.0+cu92, 1.7.1, 1.7.1+cpu, 1.7.1+cu101, 1.7.1+cu110, 1.7.1+cu92, 1.7.1+rocm3.7, 1.7.1+rocm3.8, 1.8.0, 1.8.0+cpu, 1.8.0+cu101, 1.8.0+cu111, 1.8.0+rocm3.10, 1.8.0+rocm4.0.1, 1.8.1, 1.8.1+cpu, 1.8.1+cu101, 1.8.1+cu102, 1.8.1+cu111, 1.8.1+rocm3.10, 1.8.1+rocm4.0.1, 1.9.0, 1.9.0+cpu, 1.9.0+cu102, 1.9.0+cu111, 1.9.0+rocm4.0.1, 1.9.0+rocm4.1, 1.9.0+rocm4.2, 1.9.1, 1.9.1+cpu, 1.9.1+cu102, 1.9.1+cu111, 1.9.1+rocm4.0.1, 1.9.1+rocm4.1, 1.9.1+rocm4.2, 1.10.0, 1.10.0+cpu, 1.10.0+cu102, 1.10.0+cu111, 1.10.0+cu113, 1.10.0+rocm4.0.1, 1.10.0+rocm4.1, 1.10.0+rocm4.2, 1.10.1, 1.10.1+cpu, 1.10.1+cu102, 1.10.1+cu111, 1.10.1+cu113, 1.10.1+rocm4.0.1, 1.10.1+rocm4.1, 1.10.1+rocm4.2, 1.10.2, 1.10.2+cpu, 1.10.2+cu102, 1.10.2+cu111, 1.10.2+cu113, 1.10.2+rocm4.0.1, 1.10.2+rocm4.1, 1.10.2+rocm4.2, 1.11.0, 1.11.0+cpu, 1.11.0+cu102, 1.11.0+cu113, 1.11.0+cu115, 1.11.0+rocm4.3.1, 1.11.0+rocm4.5.2, 1.12.0, 1.12.0+cpu, 1.12.0+cu102, 1.12.0+cu113, 1.12.0+cu116, 1.12.0+rocm5.0, 1.12.0+rocm5.1.1, 1.12.1, 1.12.1+cpu, 1.12.1+cu102, 1.12.1+cu113, 1.12.1+cu116, 1.12.1+rocm5.0, 1.12.1+rocm5.1.1, 1.13.0, 1.13.0+cpu, 1.13.0+cu116, 1.13.0+cu117, 1.13.0+cu117.with.pypi.cudnn, 1.13.0+rocm5.1.1, 1.13.0+rocm5.2, 1.13.1, 1.13.1+cpu, 1.13.1+cu116, 1.13.1+cu117, 1.13.1+cu117.with.pypi.cudnn, 1.13.1+rocm5.1.1, 1.13.1+rocm5.2)

more detail you can find in bolg: deepspeed chat

chenyangMl avatar Apr 24 '23 14:04 chenyangMl

+1,same question. @Flywolfs @chenyangMl Could you please post pip list result? I only changed cuda11.3 + torch1.12 and still got errors. Thanks.

Doraemon20190612 avatar Apr 30 '23 19:04 Doraemon20190612

same question here

houwenxin avatar May 16 '23 15:05 houwenxin

try export CUDA_HOME=/usr/local/cuda-xxxx"

LeeChongKeat avatar Jun 01 '23 05:06 LeeChongKeat

Closing this issue for now - if there are new installation issues, please feel free to open new tickets.

loadams avatar Jun 12 '23 21:06 loadams