DeepSpeed
DeepSpeed copied to clipboard
[BUG] RuntimeError: Error building extension 'utils' (`ninja` related?)
Describe the bug As shown in this notebook, I run these commands:
pip install deepspeed --upgrade
git clone https://github.com/microsoft/DeepSpeedExamples
cd DeepSpeedExamples/model_compression/gpt2
pip install -r requirements.txt
sudo apt-get install ninja-build # I don't think this line is actually needed, but I'm not sure
pip install ninja
bash ./bash_script/run_zero_quant.sh
This is exactly following the instructions in the readme of DeepSpeedExamples/tree/master/model_compression/gpt2 except that I had to install ninja because the machine didn't have it yet.
And after some progress, the run_zero_quant.sh
script throws RuntimeError: Error building extension 'utils'
(please see the notebook for full logs).
To Reproduce Steps to reproduce the behavior:
- Run this notebook: https://gist.github.com/josephrocca/9ec65e8e5804286a475b5b6da85f7a28
Expected behavior There is a related issue here:
- https://github.com/microsoft/DeepSpeed/issues/694
The apparent solution there was to ensure that the deepspeed wheel was built with the same cuda version as the machine has installed. But the ds_report
shows that the versions match. So I guess the "expected behavior" here is that it shouldn't throw the error that I'm seeing.
ds_report output As seen in the above-linked notebook:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/lib/python3/dist-packages/torch']
torch version .................... 1.11.0
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.6
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU count and types: 1x RTX 6000 (24GB)
- Python version: 3.8
- Any other relevant info about your setup: I used https://lambdalabs.com/ GPU cloud (using their Cloud IDE)
Hi @josephrocca, thanks for using DeepSpeed. Could you try pre-compiling and let me know the outcome? To do so:
- Uninstall DeepSpeed
pip uninstall -y deepspeed
- clone our repo
git clone https://github.com/microsoft/DeepSpeed.git && cd DeepSpeed
- install with either
DS_BUILD_OPS=1 pip install .
orDS_BUILD_UTILS=1 pip install .
(read more here: https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
Hi @mrwyattii, I tried both the DS_BUILD_OPS
option and the DS_BUILD_UTILS
option on a fresh Lambda Cloud machine, and both gave errors. Please see here for the full error logs of both attempts: https://gist.github.com/josephrocca/8417c4665cbfef89ba85e439c17500da
Solution?
I see this error message in the gist log. Can you confirm that pybind11 is installed?
This looks to have been pybind11 related, if you are still having issues with this, please re-open.
sudo apt install python3-pybind11
windows上,我在这里https://pypi.org/project/deepspeed/#files 下载了对应的包,解压之后直接放在虚拟环境里可以成功