DA-Transformer icon indicating copy to clipboard operation
DA-Transformer copied to clipboard

Compiled Failed

Open jchang98 opened this issue 2 years ago • 6 comments

  • python 3.7.12
  • pytorch 1.11.0+cu102
  • gcc 5.4

I have modified the cloneable.h file according to the FAQs section, but I still encounter the following error when the program is running. Please tell me how can i fix it?

 
Traceback (most recent call last):  
File /home/env/nat/lib/python3.7/site-packages/torch/utils/cpp_extension.py, line 1746, in _run_ninja_build   env=env)
File /home/env/nat/lib/python3.7/subprocess.py, line 512, in run   output=stdout, stderr=stderr)  subprocess.CalledProcessError: Command [ninja, -v] returned non-zero exit status 1.
The above exception was the direct cause of the following exception:

RuntimeError: Error building extension 'dag_loss_fn': [1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/env/nat/lib/python3.7/site-packages/torch/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/TH -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/env/nat/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -std=c++14 -c /home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
FAILED: logsoftmax_gather.cuda.o 

/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/env/nat/lib/python3.7/site-packages/torch/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/TH -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/env/nat/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -std=c++14 -c /home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
/home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:31:23: fatal error: cub/cub.cuh: No such file or directory compilation terminated.
ninja: build stopped: subcommand failed

jchang98 avatar Jul 21 '22 10:07 jchang98

It seems that PyTorch1.11 removes cub from the default including directory. A direct workaround is using Pytorch1.10.

I am trying to include cub in pytorch1.11 and will update this issue if I find an solution.

hzhwcmhf avatar Jul 21 '22 10:07 hzhwcmhf

It seems that PyTorch1.11 removes cub from the default including directory. A direct workaround is using Pytorch1.10.

I am trying to include cub in pytorch1.11 and will update this issue if I find an solution.

I try to reinstall pytorch1.10.1, but it doesn't work (T⌓T)

jchang98 avatar Jul 21 '22 11:07 jchang98

I am trying to reproduce your environment... it may take some times before I can find a solution.

If possible, you can also try using cuda>=11.0. Or just skip cuda compiling by adding the following arguments:

--torch-dag-loss                  # Use torch implementation for dag loss instead cuda implementation. It may become slower and consume more memory.
--torch-dag-best-alignment        # Use torch implementation for best-alignment instead cuda implementation. It may become slower and consume more memory.
--torch-dag-logsoftmax-gather     # Use torch implementation for logsoftmax-gather instead cuda implementation. It may become slower and consume more memory.

hzhwcmhf avatar Jul 21 '22 11:07 hzhwcmhf

I am trying to reproduce your environment... it may take some times before I can find a solution.

If possible, you can also try using cuda>=11.0. Or just skip cuda compiling by adding the following arguments:

--torch-dag-loss                  # Use torch implementation for dag loss instead cuda implementation. It may become slower and consume more memory.
--torch-dag-best-alignment        # Use torch implementation for best-alignment instead cuda implementation. It may become slower and consume more memory.
--torch-dag-logsoftmax-gather     # Use torch implementation for logsoftmax-gather instead cuda implementation. It may become slower and consume more memory.

Okay, thanks!

jchang98 avatar Jul 21 '22 12:07 jchang98

@jchang98 I have pushed an update which manuanlly includes the cub library. Please re-clone this repo and try again.

hzhwcmhf avatar Jul 21 '22 13:07 hzhwcmhf

@sudanl Can you run the script with only one GPU (single process)? It only says that the Cuda program is not correctly compiled but not shows the real errors.

hzhwcmhf avatar Oct 17 '22 12:10 hzhwcmhf