exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Compiling issue on Sagemaker

Open buzzCraft opened this issue 1 year ago • 6 comments

Have anyone had sucsess compiling on SageMaker? There is probably a lot more for me to explore, but just wanted to check if anyone has faced the same issues

I tried loading up the standard 3.10 python image on ml.g4dn.xlarge (Tesla T4) then do !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 !git clone https://github.com/turboderp/exllama !pip install -r exllama/requirements.txt !python exllama/test_benchmark_inference.py -d ./Combined3b -p -ppl

The error I get is `Successfully preprocessed all matching files. Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/usr/local/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/exllama/test_benchmark_inference.py", line 1, in from model import ExLlama, ExLlamaCache, ExLlamaConfig File "/root//exllama/model.py", line 12, in import cuda_ext File "/root/exllama/cuda_ext.py", line 43, in exllama_ext = load( File "/usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in jit_compile write_ninja_file_and_build_library( File "/usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in write_ninja_file_and_build_library run_ninja_build( File "/usr/local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'exllama_ext': [1/12] bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/exllama/exllama_ext -isystem /usr/local/lib/python3.10/site-packages/torch/include -isystem /usr/local/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/lib/python3.10/site-packages/torch/include/THH -isystem include -isystem /usr/local/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS=1 -lineinfo -U__HIP_NO_HALF_CONVERSIONS -O3 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /root//exllama/exllama_ext/hip_func/column_remap.hip -o column_remap.cuda.o FAILED: column_remap.cuda.o `

buzzCraft avatar Jun 26 '23 10:06 buzzCraft

I'm not familiar with Sagemaker but subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. implies that the ninja build tool isn't installed If you add pip install ninja does it make things work?

allenbenz avatar Jun 26 '23 17:06 allenbenz

I'll give it a try tomorrow and report back.. Thank you for the suggestion

Ok, so i got ninja, but no build.ninja `ninja -v

ninja: error: loading 'build.ninja': No such file or directory

ninja --version

1.11.1.git.kitware.jobserver-1 `

I'll keep on digging when I have some time off Edit: By the way, I tried running it locally, and it worked without any additional configuration, except for the limitation of my 4GB GPU, of course. It seems that the problem might be related to a permissions issue with the torch folder I have on Sagemaker.

buzzCraft avatar Jun 26 '23 18:06 buzzCraft

Is your sagemaker instance running linux?

Maybe you need to also include:

!sudo apt-get install -y ninja-build

kkotsche1 avatar Jun 27 '23 22:06 kkotsche1

Unsure why you are using the rocm torch version when you are using an nvidia tesla T4, but try using the normal version.

jmoney7823956789378 avatar Jun 27 '23 22:06 jmoney7823956789378

Is your sagemaker instance running linux?

Maybe you need to also include:

!sudo apt-get install -y ninja-build

Installed ninja-build, but didnt solve the issue

Unsure why you are using the rocm torch version when you are using an nvidia tesla T4, but try using the normal version.

Worth a try. In my experience, setting up CUDA to work on Sagemaker can be a bit tricky, so I tend to go for the pre-buildt GPU optimized images.

Thanks for helping out

buzzCraft avatar Jun 28 '23 06:06 buzzCraft

No problem, not sure what the other guy meant with ninja... the fact that you're getting an error code and message from ninja means you have it installed at the very least. As far as I know, ROCm torch isn't meant for nvidia cards. Nvidia cards get their own special treatment with cuda :)

jmoney7823956789378 avatar Jun 28 '23 10:06 jmoney7823956789378

Closing the issue for now, since its an run enviorment issue. Ill update with the solution if I get around to fix it.

buzzCraft avatar Jul 03 '23 07:07 buzzCraft