FastFold icon indicating copy to clipboard operation
FastFold copied to clipboard

No module named 'fastfold_softmax_cuda'

Open SimonKitSangChu opened this issue 3 years ago • 5 comments

I followed the installation instruction on anaconda but receive an error on fastfold_softmax_cuda. The machine has cuda version 11.6.2 installed.

Colossalai should be built with cuda extension to use the FP16 optimizer                                                                                                                                           
If you want to activate cuda mode for MoE, please install with cuda_ext!                                                                                                                                           
Traceback (most recent call last):                                                                                                                                                                                 
  File "inference.py", line 25, in <module>                                                                                                                                                                        
    from fastfold.model.hub import AlphaFold                                                                                                                                                                       
  File "/scratch/FastFold/fastfold/model/hub/__init__.py", line 1, in <module>                                                                                                                                     
    from .alphafold import AlphaFold                                                                                                                                                                               
  File "/scratch/FastFold/fastfold/model/hub/alphafold.py", line 20, in <module>
    from fastfold.utils.feats import (
  File "/scratch/FastFold/fastfold/utils/__init__.py", line 1, in <module>
    from .inject_fastnn import inject_fastnn
  File "/scratch/FastFold/fastfold/utils/inject_fastnn.py", line 9, in <module>
    from fastfold.model.fastnn import MSAStack, OutProductMean, PairStack
  File "/scratch/FastFold/fastfold/model/fastnn/__init__.py", line 1, in <module>
    from .msa import MSAStack
  File "/scratch/FastFold/fastfold/model/fastnn/msa.py", line 6, in <module>
    from fastfold.model.fastnn.kernel import LayerNorm
  File "/scratch/FastFold/fastfold/model/fastnn/kernel/__init__.py", line 3, in <module>
    from .cuda_native.softmax import softmax, scale_mask_softmax, scale_mask_bias_softmax
  File "/scratch/FastFold/fastfold/model/fastnn/kernel/cuda_native/softmax.py", line 7, in <module>
    fastfold_softmax_cuda = importlib.import_module("fastfold_softmax_cuda")
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named 'fastfold_softmax_cuda'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2184436) of binary: /share/siegellab/software/kschu/anaconda3/envs/fastfold/bin/python
Traceback (most recent call last):
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
inference.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-04_06:49:56
  host      : kakawa-1
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2184436)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

SimonKitSangChu avatar May 04 '22 06:05 SimonKitSangChu

Did you run python setup.py install in FastFold folder. Or you can attach the log of the installation.

Shenggan avatar May 04 '22 07:05 Shenggan

@Shenggan Thanks for your prompt reply. I thought the conda install command would suffice. However, I receive another cuda error when installing through setup.py.

python setup.py install                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                  
torch.__version__  = 1.11.0                                                                                                                                                                                                                                                       
                                                                                                                                         
                                                                                                                                         
                                                                    
Compiling cuda extensions with                                                                                                           
nvcc: NVIDIA (R) Cuda compiler driver                                                                                                    
Copyright (c) 2005-2019 NVIDIA Corporation                                                                                               
Built on Sun_Jul_28_19:07:16_PDT_2019                                                                                                    
Cuda compilation tools, release 10.1, V10.1.243                                                                                          
from /usr/bin                                                                                                                            
                                                                                                                                         
Traceback (most recent call last):                                                                                                       
  File "setup.py", line 90, in <module>                                                                                                  
    check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)                                                                                                                                                                                                                              
  File "setup.py", line 32, in check_cuda_torch_binary_vs_bare_metal         
    raise RuntimeError(                                                                                                                                                                                                                                                           
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.5.
In some cases, a minor-version mismatch will not cause later errors:  https://github.com/NVIDIA/apex/pull/323#discussion_r287021798.  You can try commenting out this check (at your own risk).

Removing the check results in another error.

running build_ext
Traceback (most recent call last):
  File "setup.py", line 130, in <module>
    setup(
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 410, in build_extensions
    self._check_cuda_version()
  File "/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 787, in _check_cuda_version
    raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
RuntimeError: 
The detected CUDA version (10.1) mismatches the version that was used to compile
PyTorch (11.5). Please make sure to use the same CUDA versions.

Although my nvcc version is 10.1, my cuda version is 11.5.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Under /usr/local/cuda/version.json,

{
   "cuda" : {
      "name" : "CUDA SDK",
      "version" : "11.6.2"
   },
...

SimonKitSangChu avatar May 04 '22 18:05 SimonKitSangChu

FastFold need compile cuda extension for high performance kernel. From the log it appears that your torch requires cuda 11.5 and you need a matching version of the cuda environment. Also, the expectation is that nvcc should show a matching CUDA version.

Shenggan avatar May 05 '22 01:05 Shenggan

After update nvcc to match the cuda version, there is a new error on gnu version.

python setup.py install


torch.__version__  = 1.11.0


/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/dist.py:490: UserWarning: Normalizing '0.1.0-beta' to '0.1.0b0'
  warnings.warn(tmpl.format(**locals()))
running install
/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/setuptools/command/easy_install.py:156: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing fastfold.egg-info/PKG-INFO
writing dependency_links to fastfold.egg-info/dependency_links.txt
writing requirements to fastfold.egg-info/requires.txt
writing top-level names to fastfold.egg-info/top_level.txt
/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/utils/cpp_extension.py:387: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'fastfold.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastfold.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/utils/cpp_extension.py:788: UserWarning: The detected CUDA version (11.6) has a minor version mismatch with the version that was used to compile PyTorch (11.5). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'fastfold_layer_norm_cuda' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/fastfold
creating build/temp.linux-x86_64-3.8/fastfold/model
creating build/temp.linux-x86_64-3.8/fastfold/model/fastnn
creating build/temp.linux-x86_64-3.8/fastfold/model/fastnn/kernel
creating build/temp.linux-x86_64-3.8/fastfold/model/fastnn/kernel/cuda_native
creating build/temp.linux-x86_64-3.8/fastfold/model/fastnn/kernel/cuda_native/csrc
gcc -pthread -B /share/siegellab/software/kschu/anaconda3/envs/fastfold/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /share/siegellab/software/kschu/anaconda3/envs/fastfold/include -fPIC -O2 -isystem /share/siegellab/software/kschu/anaconda3/envs/fastfold/include -fPIC -I/scratch/FastFold/fastfold/model/fastnn/kernel/cuda_native/csrc/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/TH -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/THC -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/include/python3.8 -c fastfold/model/fastnn/kernel/cuda_native/csrc/layer_norm_cuda.cpp -o build/temp.linux-x86_64-3.8/fastfold/model/fastnn/kernel/cuda_native/csrc/layer_norm_cuda.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=fastfold_layer_norm_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
/share/siegellab/software/kschu/anaconda3/envs/fastfold/bin/nvcc -I/scratch/FastFold/fastfold/model/fastnn/kernel/cuda_native/csrc/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/TH -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/include/THC -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/include -I/share/siegellab/software/kschu/anaconda3/envs/fastfold/include/python3.8 -c fastfold/model/fastnn/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu -o build/temp.linux-x86_64-3.8/fastfold/model/fastnn/kernel/cuda_native/csrc/layer_norm_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -std=c++14 -maxrregcount=50 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=fastfold_layer_norm_cuda -D_GLIBCXX_USE_CXX11_ABI=0
In file included from /usr/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
In file included from /usr/include/cuda_runtime.h:83,
                 from <command-line>:
/usr/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
  138 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      |  ^~~~~
error: command '/share/siegellab/software/kschu/anaconda3/envs/fastfold/bin/nvcc' failed with exit status 255

gcc and nvcc versions are now -

gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

SimonKitSangChu avatar May 05 '22 05:05 SimonKitSangChu

I suppose it can be solved by downgrading the GCC from 9.4.0 to 8.x.

Shenggan avatar May 05 '22 06:05 Shenggan