[feature request] Support for AMD GPU
following up #455, I'd love to able to run torch load on my AMD GPU. My Hardware is available for any test / debug / experiment around it Thanks
Cool @cregouby !
In order to get support for AMD GPU's we will need to figure out:
nice push ! I'm on it in https://github.com/cregouby/torch/tree/platform/amd_gpu Currently 1. seems to have a good start :
~/R/_packages/torch/lantern/build$ cmake ..
-- The C compiler identification is GNU 11.2.0
-- The CXX compiler identification is GNU 11.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Downloading /home/___/R/_packages/torch/lantern/build/libtorch.zip: https://download.pytorch.org/libtorch/rocm5.1.1/libtorch-cxx11-abi-shared-with-deps-1.12.1%2Brocm5.1.1.zip
I still need to add version-matching check (as I currently do not match the available rocm version on my machine)
Nice! This is looking great! Maybe ROCM can work with minor version mismatches? That's not the case for CUDA, but you could try.
Sure !
Currently dealing with Github-action workflow, I'm wondering which runs-on should be selected to have a AMD GPU hardware to run on.. Any idea on this ? (I have to admit that part of the hardware is unclear to me in github runners)
I think you can cross-compile on the default ubuntu and install the ROCm compilers. Ie, I think you can compile for ROCm in a machine that doesn't include a AMD GPU.
See eg: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#installing-development-packages-for-cross-compilation
I've made good progress on step 3. (maybe the easiest one)
I'm still hardly fighting the 1. with step-by-step progress. I've now fixed the hipBLAS requirement, and I'm now dealing with 3 more packages needed hipFFT, hipRAND, hipSPARSE. I'll keep you up to date...
Some news on the task :
- cmake is now successful on lantern
- make -j8 fails with a weird error :
....
[ 39%] Building CXX object CMakeFiles/lantern.dir/src/Dimname.cpp.o
In file included from /home/____/R/_packages/torch/lantern/src/Dtype.cpp:8:
In file included from /home/____/R/_packages/torch/lantern/src/utils.hpp:2:
/home/____/R/_packages/torch/lantern/include/lantern/types.h:13:10: warning: pack fold expression is a C++17 extension [-Wc++17-extensions]
...);
^
/home/____/R/_packages/torch/lantern/include/lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply'?
std::apply(
^~~~~~~~~~
torch::apply
/home/____/R/_packages/torch/lantern/build/libtorch/include/torch/csrc/utils/variadic.h:118:6: note: 'torch::apply' declared here
void apply(Function function, Ts&&... ts) {
^
1 warning and 1 error generated when compiling for gfx900.
...
make[2]: *** [CMakeFiles/lantern.dir/build.make:76 : CMakeFiles/lantern.dir/src/lantern.cpp.o] Erreur 1
make[1]: *** [CMakeFiles/Makefile2:85 : CMakeFiles/lantern.dir/all] Erreur 2
make: *** [Makefile:91 : all] Erreur 2
any suggestion would be appreciated
Great!!
Perhaps something equivalent to the below for ROCM is missing?
https://github.com/mlverse/torch/blob/fef4bf086c9fa4c5420997c04f01190cb4594d5d/lantern/CMakeLists.txt#L192
It seems that setting this would help: https://cmake.org/cmake/help/latest/prop_tgt/HIP_STANDARD.html
Thanks for the hint, setting it to value 14 or 17 did not remove the C++17 extension warning....
For the error lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply' , I made the change into types.h (I must admit I'm completely lost with what to do - not to do in .h files)
https://github.com/cregouby/torch/blob/9c67675d43862cb53c7b47df7c5451eb741798ec/lantern/include/lantern/types.h#L9
and now build target lantern reaches 100 %
My two big uncertainties right now are
- what is the impact of changing
type.h std::applyintotorch::apply - is
src/Contrib/SortVertices/sort_vert_cpu.cppsufficient to build on ROCm ? i.e. not includingsrc/AllocatorCuda.cppandsrc/Contrib/SortVertices/sort_vert_kernel.cu...
I don't think torch::apply is equivalent to std::apply...
I think torch::apply is equivalent to https://pytorch.org/docs/stable/generated/torch.Tensor.apply_.html while std::apply is metaprogramming stuff from C++ https://en.cppreference.com/w/cpp/utility/apply
std::apply is a C++17 feature, so that warning is probably caused by the compiler not supporting c++17, or maybe that HIP standard flag is not being correctly propagated. AFAICT in the cuda world, nvcc (the compiler that supports cuda) works like a preprocessor, ie, it will take the CUDA parts and compile and the part that of the code that is not CUDA related is forwarded to a C++ compiler, and that's where those flags matter.
Yeah, I think you don't need to provide HIP kernel for the Contrib stuff, so just building with the CPU version should be fine.
Thanks for those hints, I'll try to rework based on that ! FYI the 100% build of lantern makes `install_torch_from_file()to fail with
install_torch(version = version, type = type, install_config = install_config)
Erreur dans cpp_lantern_init(file.path(install_path(), "lib")) :
/home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so - /home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so: undefined symbol: _ZN2at4_ops4rand4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE
And despite my effort, I can't get the HIP compiler to consider C++17 code... I'll question the authors... or maybe try something else based on https://github.com/ROCm-Developer-Tools/HIP/blob/809149ecc8d751acd3c1595b590090cd86ada8df/bin/hipcc.pl#L397
# nvcc does not handle standard compiler options properly
# This can prevent hipcc being used as standard CXX/C Compiler
# To fix this we need to pass -Xcompiler for options
That's great progress!! 👍
Hmm, this seems to be related to the clang version, perhaps? Or something like this?
Ah some news here after some deeper investigation :
Support and Compatibility
| libtorch public / nightly | rocm | ubuntu installer | gfx card support | R torch |
|---|---|---|---|---|
| - | 5.0 | 908, 90a | ||
| 1.13.0 - 1.13.1 / 1.13.0 - 2.0.0 | 5.2 | 18.04/20.04 (1) | add 1011 (2) | 0.10.0 |
| - | 5.3.0 | 22.04 | add 11xx | - |
| 2.0.0-2.0.1 / 2.0.0 - 2.1.0 | 5.4.2 | 22.04 | add 1100, 1102 | 0.12.0 |
Liblantern build
Strickly following the compatibility table, I've been able to build liblantern.so for
- ROCM 5.2
- ROCM 5.4.2
using the official buildlantern.R
{torch}
I've tweeked a bit the download torch right now and get to the following success :
> # copy lantern
> source("R/install.R")
> source("R/lantern_sync.R")
> lantern_sync(TRUE)
[1] TRUE
> library(torch)
Attachement du package : ‘torch’
Les objets suivants sont masqués _par_ ‘.GlobalEnv’:
get_install_libs_url, install_torch, install_torch_from_file, torch_install_path, torch_is_installed
> torch_version
[1] "2.0.1"
> tt <- torch_tensor(c(1,2,3,4), device = "cuda")
> tt
torch_tensor
1
2
3
4
[ CUDAFloatType{4} ]
which is amazing !
I still have a discrepancy as I currently crash R when running tt + 1 due to a possible mismatch in version between libtorch and {torch}.
But I can feel the taste of success...
This is very exciting! is there a way I can help test? I have an AMD rocm computer and I would love it if torch would work on gpu, just like pytorch!
Hello @RMHogervorst ,
I'm glad you want to help!
You should clone the repo and switch to the platform/amd_gpu branch, where building the ROCM lantern is documented following the /.github/CONTRIBUTING.md.
In order to build lantern for torch 0.12, you will need the ROCM 5.4.2 suite on your machine
Let us know if you can build it.
@cregouby after cloning your repository
- First install all packages (I used renv to do that)
- I had to create the lantern directory (otherwise the build_lantern condition is not true)
- installed cmake
- run `source("tools/build_lantern.R")
CMake Error: The source directory "/home/roel/Documents/projecten/experimenten/torch/lantern" does not appear to contain CMakeLists.txt.
object path not found in lantern_sync
I think I'm missing something
I have installed the latest version of rocm 6.0.2, I can probably install the 5.4.2 version too, but I think this error is not related to the rocm version
I realized that there are cmakelist files in the src directory. (I have not a lot of experience building c projects so I probably learn a lot (do stupid stuff))
- from the src directory
- run cmake .
- run
cmake --build . --target lantern --config Release --parallel 8
This builds a library, but it seems to build it for cpu
Sorry @RMHogervorst, I didn't commit my experimental lantern/CMakeLists.txt
You should now get it if you git pull again from the cregouby/torch repo on branch platform/amd_gpu
Feel free to question or improve every line inside the CMakeLists.txt file, as makefiles are far beyond my confort zone.
After lantern is compiled, you may want to setup some environment variables.
Those are mine, stored in .Renviron (again may need some changes)
# --- torch / lantern build
# change ARCH target at `make` time
HCC_AMDGPU_TARGET=gfx900
USE_ROCM=1
BUILD_LANTERN=1
# ---- torch lantern package build ----
MAKE=make -j10
LD_LIBRARY_PATH=/opt/rocm-5.4.2/lib:/opt/rocm-5.4.2/llvm/lib:~/R/_packages/torch/inst/lib:~/R/x86_64-pc-linux-gnu-library/4.3/torch/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/snap/bin:/opt/rocm-5.4.2:/opt/rocm-5.4.2/bin
ROCM_PATH=/opt/rocm
# ---- local liblantern.so usage----
# may need a ln -s of a liblantern_<version>.so in the same directory
# The library URL can be 3 different things:
# - real URL
# - path to a zip file containing the library
# - path to a directory containing the files to be installed.
# if set, escape the download within lantern/CMakeLists.txt
# TORCH_URL= https://download.pytorch.org/libtorch/rocm5.4.2/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip
# local cache of the previous
TORCH_URL= "~/R/_packages/torch_experiment/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip"
TORCH_INSTALL_DEBUG=1