DeepSpeed
DeepSpeed copied to clipboard
[BUG] "Deepspeed: command not found" when I run shell to train my model
Describe the bug I want to use deepspeed by a script, and I installed it with pip:
(base) forestbat@vm-jupyterhub-server:~/BELLE/train$ pip install deepspeed
Defaulting to user installation because normal site-packages is not writeable
Collecting deepspeed
Using cached deepspeed-0.9.0-py3-none-any.whl
……
Successfully installed deepspeed-0.9.0
But when I try to run my shell, it tells me this:
(base) forestbat@vm-jupyterhub-server:~/BELLE/train$ bash training_scripts/single_node/run_FT.sh
training_scripts/single_node/run_FT.sh: line 17: deepspeed: command not found
and there is no deepspeed
in my conda list.
This is my script of training:
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=3
fi
mkdir -p $OUTPUT
#bigscience/bloomz-1b7
deepspeed main.py \
--sft_only_data_path BELLE/train_2M_CN.json \
--model_name_or_path dalai/alpaca/models/7B/ggml-model-q4_0.bin \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 2 \
--max_seq_len 1024 \
--learning_rate 5e-6 \
--weight_decay 0.0001 \
--num_train_epochs 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--num_warmup_steps 100 \
--seed 1234 \
--gradient_checkpointing \
--zero_stage $ZERO_STAGE \
--deepspeed \
--output_dir $OUTPUT \
# &> $OUTPUT/training.log
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/forestbat/.local/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/forestbat/.local/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
System info (please complete the following information):
- OS: Ubuntu 20.04
- Python version: 3.9
@forestbat thanks for reporting this. Could you please run which deepspeed
to determine the location of the DeepSpeed executable and share that? It would appear that the executable has not been added to your path - however you were able to run ds_report
, which is another executable script that DeepSpeed installs. Do you use bash
, zsh
, csh
, or some alternative?
@forestbat thanks for reporting this. Could you please run
which deepspeed
to determine the location of the DeepSpeed executable and share that? It would appear that the executable has not been added to your path - however you were able to runds_report
, which is another executable script that DeepSpeed installs. Do you usebash
,zsh
,csh
, or some alternative?
In fact I can't run ds_report
, report which I put here is generated by python -m deepspeed.env_report
.
And I can't get any information from which deepspeed
:
(base) forestbat@vm-jupyterhub-server:~$ which deepspeed
(base) forestbat@vm-jupyterhub-server:~$
I changed a new conda environment, now it works correctly.
I changed a new conda environment, now it works correctly.
I would like to know how you did it in ubuntu
I changed a new conda environment, now it works correctly.
I would like to know how you did it in ubuntu
conda init bash
~/.bashrc
conda activate xxx