following up #455, I'd love to able to run torch load on my AMD GPU. My Hardware is available for any test / debug / experiment around it Thanks

Oct 15 '22 12:10 cregouby

Cool @cregouby !

In order to get support for AMD GPU's we will need to figure out:

How to build lantern targeting ROCm, probably adding another set of conditions here to download the pre-built binaries for ROCm.
Setup a workflow to build lantern for ROCm and upload the pre-built binaries here
Then modify the install.R to allow installing from ROCm builds.

Oct 17 '22 12:10 dfalbel

nice push ! I'm on it in https://github.com/cregouby/torch/tree/platform/amd_gpu Currently 1. seems to have a good start :

~/R/_packages/torch/lantern/build$ cmake ..
-- The C compiler identification is GNU 11.2.0
-- The CXX compiler identification is GNU 11.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Downloading /home/___/R/_packages/torch/lantern/build/libtorch.zip: https://download.pytorch.org/libtorch/rocm5.1.1/libtorch-cxx11-abi-shared-with-deps-1.12.1%2Brocm5.1.1.zip

I still need to add version-matching check (as I currently do not match the available rocm version on my machine)

Oct 19 '22 07:10 cregouby

Nice! This is looking great! Maybe ROCM can work with minor version mismatches? That's not the case for CUDA, but you could try.

Oct 19 '22 11:10 dfalbel

Sure ! Currently dealing with Github-action workflow, I'm wondering which runs-on should be selected to have a AMD GPU hardware to run on.. Any idea on this ? (I have to admit that part of the hardware is unclear to me in github runners)

Oct 20 '22 07:10 cregouby

I think you can cross-compile on the default ubuntu and install the ROCm compilers. Ie, I think you can compile for ROCm in a machine that doesn't include a AMD GPU.

See eg: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html#installing-development-packages-for-cross-compilation

Oct 20 '22 10:10 dfalbel

I've made good progress on step 3. (maybe the easiest one) I'm still hardly fighting the 1. with step-by-step progress. I've now fixed the hipBLAS requirement, and I'm now dealing with 3 more packages needed hipFFT, hipRAND, hipSPARSE. I'll keep you up to date...

Oct 27 '22 06:10 cregouby

Some news on the task :

cmake is now successful on lantern
make -j8 fails with a weird error :

....
[ 39%] Building CXX object CMakeFiles/lantern.dir/src/Dimname.cpp.o                                                                                                                                                                          
In file included from /home/____/R/_packages/torch/lantern/src/Dtype.cpp:8:                                                                                                                                                                  
In file included from /home/____/R/_packages/torch/lantern/src/utils.hpp:2:                                                                                                                                                                  
/home/____/R/_packages/torch/lantern/include/lantern/types.h:13:10: warning: pack fold expression is a C++17 extension [-Wc++17-extensions]                                                                                                  
         ...);                                                                                                                                                                                                                               
         ^                                                                                                                                                                                                                                   
/home/____/R/_packages/torch/lantern/include/lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply'?                                                                                            
  std::apply(                                                                                                                                                                                                                                
  ^~~~~~~~~~                                                                                                                                                                                                                                 
  torch::apply                                                                                                                                                                                                                               
/home/____/R/_packages/torch/lantern/build/libtorch/include/torch/csrc/utils/variadic.h:118:6: note: 'torch::apply' declared here                                                                                                            
void apply(Function function, Ts&&... ts) {                                                                                                                                                                                                  
     ^                                                                                                                                                                                                                                       
1 warning and 1 error generated when compiling for gfx900. 
...
make[2]: *** [CMakeFiles/lantern.dir/build.make:76 : CMakeFiles/lantern.dir/src/lantern.cpp.o] Erreur 1
make[1]: *** [CMakeFiles/Makefile2:85 : CMakeFiles/lantern.dir/all] Erreur 2
make: *** [Makefile:91 : all] Erreur 2

any suggestion would be appreciated

Nov 04 '22 18:11 cregouby

Great!!

Perhaps something equivalent to the below for ROCM is missing?

https://github.com/mlverse/torch/blob/fef4bf086c9fa4c5420997c04f01190cb4594d5d/lantern/CMakeLists.txt#L192

Nov 04 '22 22:11 dfalbel

It seems that setting this would help: https://cmake.org/cmake/help/latest/prop_tgt/HIP_STANDARD.html

Nov 04 '22 22:11 dfalbel

Thanks for the hint, setting it to value 14 or 17 did not remove the C++17 extension warning....

For the error lantern/types.h:9:3: error: no member named 'apply' in namespace 'std'; did you mean 'torch::apply' , I made the change into types.h (I must admit I'm completely lost with what to do - not to do in .h files) https://github.com/cregouby/torch/blob/9c67675d43862cb53c7b47df7c5451eb741798ec/lantern/include/lantern/types.h#L9 and now build target lantern reaches 100 %

My two big uncertainties right now are

what is the impact of changing type.h std::apply into torch::apply
is src/Contrib/SortVertices/sort_vert_cpu.cpp sufficient to build on ROCm ? i.e. not including src/AllocatorCuda.cpp and src/Contrib/SortVertices/sort_vert_kernel.cu ...

Nov 04 '22 23:11 cregouby

I don't think torch::apply is equivalent to std::apply... I think torch::apply is equivalent to https://pytorch.org/docs/stable/generated/torch.Tensor.apply_.html while std::apply is metaprogramming stuff from C++ https://en.cppreference.com/w/cpp/utility/apply

std::apply is a C++17 feature, so that warning is probably caused by the compiler not supporting c++17, or maybe that HIP standard flag is not being correctly propagated. AFAICT in the cuda world, nvcc (the compiler that supports cuda) works like a preprocessor, ie, it will take the CUDA parts and compile and the part that of the code that is not CUDA related is forwarded to a C++ compiler, and that's where those flags matter.

Yeah, I think you don't need to provide HIP kernel for the Contrib stuff, so just building with the CPU version should be fine.

Nov 04 '22 23:11 dfalbel

Thanks for those hints, I'll try to rework based on that ! FYI the 100% build of lantern makes `install_torch_from_file()to fail with

install_torch(version = version, type = type, install_config = install_config)
Erreur dans cpp_lantern_init(file.path(install_path(), "lib")) : 
  /home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so - /home/____/R/x86_64-pc-linux-gnu-library/4.2/torch/lib/liblantern.so: undefined symbol: _ZN2at4_ops4rand4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE

And despite my effort, I can't get the HIP compiler to consider C++17 code... I'll question the authors... or maybe try something else based on https://github.com/ROCm-Developer-Tools/HIP/blob/809149ecc8d751acd3c1595b590090cd86ada8df/bin/hipcc.pl#L397

    # nvcc does not handle standard compiler options properly
    # This can prevent hipcc being used as standard CXX/C Compiler
    # To fix this we need to pass -Xcompiler for options

Nov 07 '22 18:11 cregouby

That's great progress!! 👍

Hmm, this seems to be related to the clang version, perhaps? Or something like this?

Nov 09 '22 16:11 dfalbel

Ah some news here after some deeper investigation :

Support and Compatibility

libtorch public / nightly	rocm	ubuntu installer	gfx card support	R torch
-	5.0		908, 90a
1.13.0 - 1.13.1 / 1.13.0 - 2.0.0	5.2	18.04/20.04 (1)	add 1011 (2)	0.10.0
-	5.3.0	22.04	add 11xx	-
2.0.0-2.0.1 / 2.0.0 - 2.1.0	5.4.2	22.04	add 1100, 1102	0.12.0

Liblantern build

Strickly following the compatibility table, I've been able to build liblantern.so for

ROCM 5.2
ROCM 5.4.2

using the official buildlantern.R

{torch}

I've tweeked a bit the download torch right now and get to the following success :

>   # copy lantern
>   source("R/install.R")
>   source("R/lantern_sync.R")
>   lantern_sync(TRUE)
[1] TRUE
> library(torch)

Attachement du package : ‘torch’

Les objets suivants sont masqués _par_ ‘.GlobalEnv’:

    get_install_libs_url, install_torch, install_torch_from_file, torch_install_path, torch_is_installed

> torch_version
[1] "2.0.1"
> tt <- torch_tensor(c(1,2,3,4), device = "cuda")
> tt
torch_tensor
 1
 2
 3
 4
[ CUDAFloatType{4} ]

which is amazing !

I still have a discrepancy as I currently crash R when running tt + 1 due to a possible mismatch in version between libtorch and {torch}.

But I can feel the taste of success...

Jan 31 '24 19:01 cregouby

This is very exciting! is there a way I can help test? I have an AMD rocm computer and I would love it if torch would work on gpu, just like pytorch!

Feb 20 '24 19:02 RMHogervorst

Hello @RMHogervorst , I'm glad you want to help! You should clone the repo and switch to the platform/amd_gpu branch, where building the ROCM lantern is documented following the /.github/CONTRIBUTING.md. In order to build lantern for torch 0.12, you will need the ROCM 5.4.2 suite on your machine Let us know if you can build it.

Feb 21 '24 18:02 cregouby

@cregouby after cloning your repository

First install all packages (I used renv to do that)
I had to create the lantern directory (otherwise the build_lantern condition is not true)
installed cmake
run `source("tools/build_lantern.R")

CMake Error: The source directory "/home/roel/Documents/projecten/experimenten/torch/lantern" does not appear to contain CMakeLists.txt.

object path not found in lantern_sync

I think I'm missing something

I have installed the latest version of rocm 6.0.2, I can probably install the 5.4.2 version too, but I think this error is not related to the rocm version

Feb 21 '24 19:02 RMHogervorst

I realized that there are cmakelist files in the src directory. (I have not a lot of experience building c projects so I probably learn a lot (do stupid stuff))

from the src directory
run cmake .
run cmake --build . --target lantern --config Release --parallel 8

This builds a library, but it seems to build it for cpu

Feb 21 '24 20:02 RMHogervorst

Sorry @RMHogervorst, I didn't commit my experimental lantern/CMakeLists.txt You should now get it if you git pull again from the cregouby/torch repo on branch platform/amd_gpu

Feel free to question or improve every line inside the CMakeLists.txt file, as makefiles are far beyond my confort zone.

After lantern is compiled, you may want to setup some environment variables.

Those are mine, stored in .Renviron (again may need some changes)

# --- torch  / lantern build
# change ARCH target at `make` time
HCC_AMDGPU_TARGET=gfx900
USE_ROCM=1
BUILD_LANTERN=1

# ---- torch lantern package build ----
MAKE=make -j10
LD_LIBRARY_PATH=/opt/rocm-5.4.2/lib:/opt/rocm-5.4.2/llvm/lib:~/R/_packages/torch/inst/lib:~/R/x86_64-pc-linux-gnu-library/4.3/torch/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/snap/bin:/opt/rocm-5.4.2:/opt/rocm-5.4.2/bin
ROCM_PATH=/opt/rocm
# ---- local liblantern.so usage----
# may need a ln -s of a liblantern_<version>.so in the same directory
  # The library URL can be 3 different things:
  # - real URL
  # - path to a zip file containing the library
  # - path to a directory containing the files to be installed. 
# if set, escape the download within lantern/CMakeLists.txt
# TORCH_URL= https://download.pytorch.org/libtorch/rocm5.4.2/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip
# local cache of the previous
TORCH_URL= "~/R/_packages/torch_experiment/libtorch-cxx11-abi-shared-with-deps-2.0.1%2Brocm5.4.2.zip"
TORCH_INSTALL_DEBUG=1

Feb 21 '24 22:02 cregouby

[feature request] Support for AMD GPU

Support and Compatibility

Liblantern build

{torch}