apex icon indicating copy to clipboard operation
apex copied to clipboard

Apex installation failed

Open Teng-xu opened this issue 2 years ago • 10 comments

I was trying to install apex through dockerfile (python3.6 cuda11.1) via the following commands

RUN git clone https://github.com/NVIDIA/apex && \
        cd apex && \
        pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

and I got the following errors, it was able to build 2 days ago, but it fails now and the failure seems to be related to fused_dense_cuda.cu

[0m[91m    csrc/fused_dense_cuda.cu(415): error: identifier "CUBLASLT_EPILOGUE_GELU_AUX" is undefined

    csrc/fused_dense_cuda.cu(427): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(428): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD" is undefined

    csrc/fused_dense_cuda.cu(435): error: identifier "CUBLASLT_EPILOGUE_GELU_AUX_BIAS" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(555): error: identifier "CUBLASLT_EPILOGUE_GELU_AUX" is undefined

    csrc/fused_dense_cuda.cu(567): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER" is undefined
[0m[91m
    csrc/fused_dense_cuda.cu(568): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(575): error: identifier "CUBLASLT_EPILOGUE_GELU_AUX_BIAS" is undefined
[0m[91m
[0m[91m    csrc/fused_dense_cuda.cu(687): error: identifier "CUBLASLT_EPILOGUE_BGRADB" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(826): error: identifier "CUBLASLT_EPILOGUE_BGRADB" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(920): error: identifier "CUBLASLT_EPILOGUE_DGELU_BGRAD" is undefined
[0m[91m
[0m[91m    csrc/fused_dense_cuda.cu(936): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER" is undefined
[0m[91m
[0m[91m    csrc/fused_dense_cuda.cu(940): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(1055): error: identifier "CUBLASLT_EPILOGUE_DGELU_BGRAD" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(1071): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER" is undefined
[0m[91m
[0m[91m    csrc/fused_dense_cuda.cu(1075): error: identifier "CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD" is undefined

[0m[91m    csrc/fused_dense_cuda.cu(1203): warning: variable "beta_one" was declared but never referenced

[0m[91m    csrc/fused_dense_cuda.cu(1332): warning: variable "beta_one" was declared but never referenced
[0m[91m
[0m[91m    16 errors detected in the compilation of "csrc/fused_dense_cuda.cu".
[0m[91m    error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

Teng-xu avatar Sep 01 '21 21:09 Teng-xu

I got the same error when I compile it with python setup.py --cuda_ext --cpp_ext build in ArchLinux, cuda 11.4. check the full log: python-apex-git.log.txt

hubutui avatar Sep 02 '21 01:09 hubutui

Same error, pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Ubuntu 20.04, Cuda 11.1

P.S. Old Apex june 30 version 0.1 - OK.

Vadim2S avatar Sep 02 '21 11:09 Vadim2S

I have same problem. It comes from the difference of CUBLAS version. CUBLASLT_EPILOGUE_GELU_AUX is from CUDA11.4 but isn't in CUDA11.3. Is there anybody who know to go back to the old version of apex using git?

hyungdal avatar Sep 03 '21 08:09 hyungdal

my old apex ok, it's behind 23 commit . git reset --hard 0c2c6eea

linyu0219 avatar Sep 03 '21 09:09 linyu0219

Same error, pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Ubuntu 20.04, Cuda 11.1

P.S. Old Apex june 30 version 0.1 - OK.

Useful!

linyu0219 avatar Sep 03 '21 09:09 linyu0219

Thanks:)

git checkout 0c2c6eea6556b208d1a8711197efc94899e754e1(17th July) is OK too. Because I found the first version of apex that contain the GeLU function in git log. I succeeded to install it too.

But, I recommend to install the version of CUDA11.4.

hyungdal avatar Sep 03 '21 09:09 hyungdal

@seryilmaz it seems your recent change needs a guard for older cublas versions e.g. in https://github.com/NVIDIA/apex/blob/ae1cdd64314e598b935a8138b3532d4b652a8f12/csrc/fused_dense_cuda.cu#L687

ptrblck avatar Sep 03 '21 20:09 ptrblck

I've merged https://github.com/NVIDIA/apex/pull/1162. Could you pull the latest master and retry the build again, please?

ptrblck avatar Sep 04 '21 07:09 ptrblck

@ptrblck I can now compile and build python-apex-git in ArchLinux. Thanks.

hubutui avatar Sep 04 '21 14:09 hubutui

I had the same error even after changing the setup.py file. It was successfully installed after changing to CUDA 11.3. This CUDA/CUDNN installation script was very helpful. Here!

guialfaro053 avatar Aug 26 '22 08:08 guialfaro053

my old apex ok, it's behind 23 commit . git reset --hard 0c2c6ee

It installs after this version of Apex is pulled.

ck090 avatar Oct 28 '22 05:10 ck090