apex icon indicating copy to clipboard operation
apex copied to clipboard

apex not supporting CUDA 11.0? [Help me]

Open MuruganR96 opened this issue 4 years ago • 24 comments

My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2 As a result I can't properly install apex.

ImportError: cannot import name 'amp'

Software Versions pre-installed:

Nvidia Driver: 450.51v
CUDA: 11v
cuDNN: 8.0v
Python: 3.8
Docker: 19.03.12v
Nvidia-docker: 2.0v
NGC(Nvidia GPU Cloud) CLI: 1.15.0v

i followed this commands:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

when normally import apex, it is working.

python -c "import apex"

but in the main program, not working.

Traceback (most recent call last):
  File "train.py", line 188, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 83, in train
    from apex import amp
ImportError: cannot import name 'amp'

not importing apex module.

Please help me to solve this issue @definitelynotmcarilli @thorjohnsen @mcarilli @kexinyu @ptrblck :)

MuruganR96 avatar Nov 07 '20 03:11 MuruganR96

but I found the pytorch latest version from this website is 10.2

The latest PyTorch binaries can be installed with CUDA11.0 as shown in the install instructions.

Note that mixed-precision training is available in PyTorch directly via torch.cuda.amp as explained here and we recommend to use the native implementation.

In case you have trouble building apex, you could use a PyTorch NGC container with CUDA11.1, where PyTorch and apex are installed.

ptrblck avatar Nov 07 '20 06:11 ptrblck

@ptrblck CUDA 11.0 supports MIG. Is this feature available on PyTorch? or any tips?

I met following error

/home/sakaia/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-PCIE-40GB MIG 3g.20gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-PCIE-40GB MIG 3g.20gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "toy_problem.py", line 87, in <module>
    main(args)
  File "toy_problem.py", line 41, in main
    optimizer = FusedAdam(model.parameters())
  File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6.egg/apex/optimizers/fused_adam.py", line 79, in __init__
    raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions')
RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

sakaia avatar Nov 09 '20 06:11 sakaia

MIG is not PyTorch-specific and can be enabled on your A100.

The error shows that you are using a PyTorch build, which doesn't support the necessary compute capability for your A100 (sm_80) so either install the PyTorch binaries with CUDA11.0 or build from source.

ptrblck avatar Nov 09 '20 09:11 ptrblck

Thanks, I use pip3 to install. I will switch another method.

sakaia avatar Nov 09 '20 09:11 sakaia

How can I tell apex to use cuda-11.0? I have both cuda-11.0 and cuda-11.1 installed and it fails to build as it doesn't find cuda-11.0:

    Compiling cuda extensions with
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2020 NVIDIA Corporation
    Built on Mon_Oct_12_20:09:46_PDT_2020
    Cuda compilation tools, release 11.1, V11.1.105
    Build cuda_11.1.TC455_06.29190527_0
    from /usr/local/cuda-11.1/bin

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-yz2qpdod/setup.py", line 152, in <module>
        check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
      File "/tmp/pip-req-build-yz2qpdod/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
        raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
    RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.  Pytorch binaries were compiled with Cuda 11.0.

Also would it be possible to make apex builds on conda-forge for cuda11.0 and cuda11.1?

Thank you!

stas00 avatar Nov 11 '20 19:11 stas00

@stas00 you can try to use CUDA_HOME=/usr/local/cuda-11.0 to specify the wanted CUDA version.

ptrblck avatar Nov 12 '20 20:11 ptrblck

Awesome!

CUDA_HOME=/usr/local/cuda-11.0 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

but no luck building it:

    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/Parallel.h:149,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
                     from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
                     from /tmp/pip-req-build-ngd468_f/csrc/amp_C_frontend.cpp:1:
    /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:84: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
       84 | #pragma omp parallel for if ((end - begin) >= grain_size)
          |
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
        subprocess.run(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 512, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-ngd468_f/setup.py", line 405, in <module>
        setup(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 674, in build_extensions
        build_ext.build_extensions(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
        _build_ext.build_ext.build_extensions(self)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
        _build_ext.build_extension(self, ext)
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
        objects = self.compiler.compile(sources,
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 494, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1252, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.                                                                                           
Exception information:
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 838, in install
    success = install_legacy(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 86, in install
    raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 228, in _main
    status = self.run(options, args)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 182, in wrapper
    return func(self, options, args)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 397, in run
    installed = install_given_reqs(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/__init__.py", line 82, in install_given_reqs
    requirement.install(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 856, in install
    six.reraise(*exc.parent)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/six.py", line 703, in reraise
    raise value
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 74, in install
    runner(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 273, in runner
    call_subprocess(
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
    raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.

stas00 avatar Nov 12 '20 20:11 stas00

I don't see the error message besides that it's failing and don't know if the right CUDA version was found now. If you've linked the versioned CUDA toolkits to /urs/local/cuda, could you recreate the symbolic links with the desired CUDA version? Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

ptrblck avatar Nov 12 '20 20:11 ptrblck

Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

There is no option to do that, so I had to hack setup.py to disable the check:

diff --git a/setup.py b/setup.py
index 063b42d..9eabb49 100644
--- a/setup.py
+++ b/setup.py
@@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir):
     return raw_output, bare_metal_major, bare_metal_minor

 def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+    return
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
[...]
Successfully installed apex-0.1

So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,

Yay!

And it works just fine!

Thank you, @ptrblck!

stas00 avatar Nov 12 '20 21:11 stas00

@ptrblck When I install the apex toolkit ,I met some problems below:

    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1
    Running setup.py install for apex ... error

And I have searched the problem on some search egine, But got no anwser. How can I do properly on this issue? My GPU hardware information bellow: NVIDIA RTX 3090 And CUDA compiler info bellow:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Thank you!

Welllee12366 avatar Nov 15 '20 06:11 Welllee12366

I'm pretty sure you need cuda-11.1 for that - I built apex with it despite pytorch using cudatoolkit-11.0.

Once you have cuda-11.1 installed, follow the notes in https://github.com/NVIDIA/apex/issues/988#issuecomment-726343453

stas00 avatar Nov 15 '20 06:11 stas00

I'm pretty sure you need cuda-11.1 for that - I built apex with it despite pt using cudatoolkit-11.0

Once you have cuda-11.1 installed, follow the notes in #988 (comment)

Awesome! Thanks for your reply, I have installed the toolkit successfully.

Welllee12366 avatar Nov 15 '20 06:11 Welllee12366

I added a proper solution here: https://github.com/NVIDIA/apex/pull/997

stas00 avatar Nov 15 '20 19:11 stas00

@ptrblck When I install the apex toolkit ,I met some problems below:

    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1
    Running setup.py install for apex ... error

And I have searched the problem on some search egine, But got no anwser. How can I do properly on this issue? My GPU hardware information bellow: NVIDIA RTX 3090 And CUDA compiler info bellow:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

Thank you!

Hello, I met the same problem with you. Can you tell me how you solve the problem? Thanks a lot!

Qi-Chuan avatar Jan 06 '21 14:01 Qi-Chuan

@stas00 hi,i havs same problem. nvcc fatal : Unsupported gpu architecture 'compute_86' I don't quite understand how you solved this problem?Can you show me more detail?

zhenhao-huang avatar Jan 16 '21 07:01 zhenhao-huang

@zhenhao-huang:

  1. Install cuda-11.1 system-wide
  2. Use this branch https://github.com/NVIDIA/apex/pull/997
  3. add: --global-option="--skip-minor-ver-check" to the pip install apex command in that branch

stas00 avatar Jan 16 '21 07:01 stas00

@stas00 Successfully installed apex-0.1,Thank you!

zhenhao-huang avatar Jan 16 '21 08:01 zhenhao-huang

Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.

There is no option to do that, so I had to hack setup.py to disable the check:

diff --git a/setup.py b/setup.py
index 063b42d..9eabb49 100644
--- a/setup.py
+++ b/setup.py
@@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir):
     return raw_output, bare_metal_major, bare_metal_minor

 def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+    return
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
[...]
Successfully installed apex-0.1

So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,

Yay!

And it works just fine!

Thank you, @ptrblck!

I was using this trick then install apex success, but I get into this error:

image

v-nhandt21 avatar Jun 03 '21 14:06 v-nhandt21

Hi @stas00 , I used your branch but still get the error "nvcc fatal: unsupported gpu architecture 'compute_86'" :(

empty-id avatar Jun 26 '21 07:06 empty-id

@empty-id, make sure you have cuda-11.1 or higher installed and configured correctly - please see: https://huggingface.co/transformers/master/main_classes/trainer.html#possible-problem-2

stas00 avatar Jun 26 '21 16:06 stas00

Now installed with cuda-11.1, but I met the following problem when I run a pytorch code with apex... @stas00

RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed: 

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)


template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}


#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
  struct __align__(2) __half {
    __host__ __device__ __half() { }

  protected:
    unsigned short __x;
  };

  /* All intrinsic functions are only available to nvcc compilers */
  #if defined(__CUDACC__)
    /* Definitions of intrinsics */
    __device__ __half __float2half(const float f) {
      __half val;
      asm("{  cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
      return val;
    }

    __device__ float __half2float(const __half h) {
      float val;
      asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
      return val;
    }

  #endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" __global__
void func_1(half* t0, half* aten_mul_flat) {
{
  float t0_ = __half2float(t0[10240 * (((512 * blockIdx.x + threadIdx.x) / 10240) % 5) + (512 * blockIdx.x + threadIdx.x) % 10240]);
  aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((t0_ * 0.5f) * ((tanhf((t0_ * 0.7978845834732056f) * ((t0_ * 0.04471499845385551f) * t0_ + 1.f))) + 1.f));
}
}

empty-id avatar Jun 27 '21 02:06 empty-id

Looks like the same error as reported here https://github.com/pytorch/pytorch/issues/47669#issuecomment-725073808 which apparently has been fixed in pytorch many months back. Try pytorch-1.9.0 and if it doesn't work please file a new issue.

In general use google to search for similar errors, this is how I got the above url.

stas00 avatar Jun 27 '21 03:06 stas00

@stas00 Thank you for your reply! I finally make it work now. I find your hack is not necessary. Just use torch-1.9.0-cuda11.1 to install NVIDIA/apex latest github repo is OK with cuda11.1 system-wide.

empty-id avatar Jun 27 '21 10:06 empty-id

Can you explain it in more detail? Create a diff file, copy the code above, and run it.

xiao-ming-code avatar Jun 18 '22 13:06 xiao-ming-code