apex
apex copied to clipboard
apex not supporting CUDA 11.0? [Help me]
My nvcc version is cuda 11.0, but I found the pytorch latest version from this website is 10.2 As a result I can't properly install apex.
ImportError: cannot import name 'amp'
Software Versions pre-installed:
Nvidia Driver: 450.51v
CUDA: 11v
cuDNN: 8.0v
Python: 3.8
Docker: 19.03.12v
Nvidia-docker: 2.0v
NGC(Nvidia GPU Cloud) CLI: 1.15.0v
i followed this commands:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
when normally import apex, it is working.
python -c "import apex"
but in the main program, not working.
Traceback (most recent call last):
File "train.py", line 188, in <module>
train(num_gpus, args.rank, args.group_name, **train_config)
File "train.py", line 83, in train
from apex import amp
ImportError: cannot import name 'amp'
not importing apex module.
Please help me to solve this issue @definitelynotmcarilli @thorjohnsen @mcarilli @kexinyu @ptrblck :)
but I found the pytorch latest version from this website is 10.2
The latest PyTorch binaries can be installed with CUDA11.0 as shown in the install instructions.
Note that mixed-precision training is available in PyTorch directly via torch.cuda.amp as explained here and we recommend to use the native implementation.
In case you have trouble building apex, you could use a PyTorch NGC container with CUDA11.1, where PyTorch and apex are installed.
@ptrblck CUDA 11.0 supports MIG. Is this feature available on PyTorch? or any tips?
I met following error
/home/sakaia/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-PCIE-40GB MIG 3g.20gb with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the A100-PCIE-40GB MIG 3g.20gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
File "toy_problem.py", line 87, in <module>
main(args)
File "toy_problem.py", line 41, in main
optimizer = FusedAdam(model.parameters())
File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6.egg/apex/optimizers/fused_adam.py", line 79, in __init__
raise RuntimeError('apex.optimizers.FusedAdam requires cuda extensions')
RuntimeError: apex.optimizers.FusedAdam requires cuda extensions
MIG is not PyTorch-specific and can be enabled on your A100.
The error shows that you are using a PyTorch build, which doesn't support the necessary compute capability for your A100 (sm_80) so either install the PyTorch binaries with CUDA11.0 or build from source.
Thanks, I use pip3 to install. I will switch another method.
How can I tell apex to use cuda-11.0? I have both cuda-11.0 and cuda-11.1 installed and it fails to build as it doesn't find cuda-11.0:
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
from /usr/local/cuda-11.1/bin
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-yz2qpdod/setup.py", line 152, in <module>
check_cuda_torch_binary_vs_bare_metal(torch.utils.cpp_extension.CUDA_HOME)
File "/tmp/pip-req-build-yz2qpdod/setup.py", line 102, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError("Cuda extensions are being compiled with a version of Cuda that does " +
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 11.0.
Also would it be possible to make apex builds on conda-forge for cuda11.0 and cuda11.1?
Thank you!
@stas00 you can try to use CUDA_HOME=/usr/local/cuda-11.0 to specify the wanted CUDA version.
Awesome!
CUDA_HOME=/usr/local/cuda-11.0 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
but no luck building it:
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/Parallel.h:149,
from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3,
from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5,
from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3,
from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12,
from /home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/torch/extension.h:4,
from /tmp/pip-req-build-ngd468_f/csrc/amp_C_frontend.cpp:1:
/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/include/ATen/ParallelOpenMP.h:84: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
84 | #pragma omp parallel for if ((end - begin) >= grain_size)
|
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1533, in _run_ninja_build
subprocess.run(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-ngd468_f/setup.py", line 405, in <module>
setup(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/install.py", line 545, in run
self.run_command('build')
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 674, in build_extensions
build_ext.build_extensions(self)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
_build_ext.build_extension(self, ext)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
objects = self.compiler.compile(sources,
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 494, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1252, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 838, in install
success = install_legacy(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 86, in install
raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 228, in _main
status = self.run(options, args)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 182, in wrapper
return func(self, options, args)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 397, in run
installed = install_given_reqs(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/__init__.py", line 82, in install_given_reqs
requirement.install(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/req/req_install.py", line 856, in install
six.reraise(*exc.parent)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_vendor/six.py", line 703, in reraise
raise value
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/operations/install/legacy.py", line 74, in install
runner(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 273, in runner
call_subprocess(
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /home/stas/anaconda3/envs/main-38/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-ngd468_f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-2i80o0cf/install-record.txt --single-version-externally-managed --compile --install-headers /home/stas/anaconda3/envs/main-38/include/python3.8/apex Check the logs for full command output.
I don't see the error message besides that it's failing and don't know if the right CUDA version was found now.
If you've linked the versioned CUDA toolkits to /urs/local/cuda, could you recreate the symbolic links with the desired CUDA version?
Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.
Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.
There is no option to do that, so I had to hack setup.py to disable the check:
diff --git a/setup.py b/setup.py
index 063b42d..9eabb49 100644
--- a/setup.py
+++ b/setup.py
@@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir):
return raw_output, bare_metal_major, bare_metal_minor
def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
+ return
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
[...]
Successfully installed apex-0.1
So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,
Yay!
And it works just fine!
Thank you, @ptrblck!
@ptrblck When I install the apex toolkit ,I met some problems below:
nvcc fatal : Unsupported gpu architecture 'compute_86'
error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1
Running setup.py install for apex ... error
And I have searched the problem on some search egine, But got no anwser. How can I do properly on this issue? My GPU hardware information bellow: NVIDIA RTX 3090 And CUDA compiler info bellow:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
Thank you!
I'm pretty sure you need cuda-11.1 for that - I built apex with it despite pytorch using cudatoolkit-11.0.
Once you have cuda-11.1 installed, follow the notes in https://github.com/NVIDIA/apex/issues/988#issuecomment-726343453
I'm pretty sure you need cuda-11.1 for that - I built
apexwith it despite pt using cudatoolkit-11.0Once you have cuda-11.1 installed, follow the notes in #988 (comment)
Awesome! Thanks for your reply, I have installed the toolkit successfully.
I added a proper solution here: https://github.com/NVIDIA/apex/pull/997
@ptrblck When I install the apex toolkit ,I met some problems below:
nvcc fatal : Unsupported gpu architecture 'compute_86' error: command '/usr/local/cuda-11.0/bin/nvcc' failed with exit status 1 Running setup.py install for apex ... errorAnd I have searched the problem on some search egine, But got no anwser. How can I do properly on this issue? My GPU hardware information bellow: NVIDIA RTX 3090 And CUDA compiler info bellow:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Thu_Jun_11_22:26:38_PDT_2020 Cuda compilation tools, release 11.0, V11.0.194 Build cuda_11.0_bu.TC445_37.28540450_0Thank you!
Hello, I met the same problem with you. Can you tell me how you solve the problem? Thanks a lot!
@stas00 hi,i havs same problem. nvcc fatal : Unsupported gpu architecture 'compute_86' I don't quite understand how you solved this problem?Can you show me more detail?
@zhenhao-huang:
- Install cuda-11.1 system-wide
- Use this branch https://github.com/NVIDIA/apex/pull/997
- add:
--global-option="--skip-minor-ver-check"to thepip install apexcommand in that branch
@stas00 Successfully installed apex-0.1,Thank you!
Alternatively, since only the minor version differs, you could also try to disable the minor version check (you should get an error with a link to more information in your first run), and rebuilt it.
There is no option to do that, so I had to hack
setup.pyto disable the check:diff --git a/setup.py b/setup.py index 063b42d..9eabb49 100644 --- a/setup.py +++ b/setup.py @@ -91,6 +91,7 @@ def get_cuda_bare_metal_version(cuda_dir): return raw_output, bare_metal_major, bare_metal_minor def check_cuda_torch_binary_vs_bare_metal(cuda_dir): + return$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ [...] Successfully installed apex-0.1So I successfully built apex against system-wide cuda-11.1, while having pytorch w/ cuda-11.0 installed,
Yay!
And it works just fine!
Thank you, @ptrblck!
I was using this trick then install apex success, but I get into this error:

Hi @stas00 , I used your branch but still get the error "nvcc fatal: unsupported gpu architecture 'compute_86'" :(
@empty-id, make sure you have cuda-11.1 or higher installed and configured correctly - please see: https://huggingface.co/transformers/master/main_classes/trainer.html#possible-problem-2
Now installed with cuda-11.1, but I met the following problem when I run a pytorch code with apex... @stas00
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)
nvrtc compilation failed:
#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)
template<typename T>
__device__ T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}
template<typename T>
__device__ T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}
#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
struct __align__(2) __half {
__host__ __device__ __half() { }
protected:
unsigned short __x;
};
/* All intrinsic functions are only available to nvcc compilers */
#if defined(__CUDACC__)
/* Definitions of intrinsics */
__device__ __half __float2half(const float f) {
__half val;
asm("{ cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
return val;
}
__device__ float __half2float(const __half h) {
float val;
asm("{ cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
return val;
}
#endif /* defined(__CUDACC__) */
#endif /* defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS
typedef __half half;
extern "C" __global__
void func_1(half* t0, half* aten_mul_flat) {
{
float t0_ = __half2float(t0[10240 * (((512 * blockIdx.x + threadIdx.x) / 10240) % 5) + (512 * blockIdx.x + threadIdx.x) % 10240]);
aten_mul_flat[512 * blockIdx.x + threadIdx.x] = __float2half((t0_ * 0.5f) * ((tanhf((t0_ * 0.7978845834732056f) * ((t0_ * 0.04471499845385551f) * t0_ + 1.f))) + 1.f));
}
}
Looks like the same error as reported here https://github.com/pytorch/pytorch/issues/47669#issuecomment-725073808 which apparently has been fixed in pytorch many months back. Try pytorch-1.9.0 and if it doesn't work please file a new issue.
In general use google to search for similar errors, this is how I got the above url.
@stas00 Thank you for your reply! I finally make it work now. I find your hack is not necessary. Just use torch-1.9.0-cuda11.1 to install NVIDIA/apex latest github repo is OK with cuda11.1 system-wide.
Can you explain it in more detail? Create a diff file, copy the code above, and run it.