COSMA Add support for hipBLAS

It would be nice to have support for hipBLAS as this allows to test the hip code path on a Nvidia device.

In return the support for rocBLAS could be dropped.

Jun 13 '21 09:06 oschuett

Since @gsitaraman mentioned here the potential performance benefit of using rocBLAS, I would still hesitate to switch to hipBLAS, unless you believe it's absolutely necessary.

Jun 14 '21 20:06 kabicm

I agree that you should keep rocBLAS as long as there are performance benefits to it. However, maybe you could still add support for hipBLAS alongside? It would allow me to make progress with the HIP integration testing in CP2K.

Jun 14 '21 21:06 oschuett

This should be possible. @AdhocMan, do you think this would be a simple change, or would it require adding a lot of boilerplate code to the gpu-backend?

Jun 15 '21 13:06 kabicm

hipBLAS is just an additional layer on top of rocBLAS and you can compile rocBLAS with cuda. So it should work already, although I haven't tested it.

Jun 15 '21 13:06 AdhocMan

Would we still have to add hipBLAS as a dependency of COSMA in that case?

Jun 15 '21 13:06 kabicm

No, if rocBLAS works correctly with cuda. There would just be few changes to the CMake config required.

Jun 15 '21 13:06 AdhocMan

As written here it seems the cmake for the CUDA backend of rocBLAS is broken at the moment:

rocBLAS is written with HiP kernels, so it should build and run on CUDA platforms. However, currently the cmake infrastructure is broken with a CUDA backend. However, a BLAS marshalling library that presents a common interface for both ROCm and CUDA backends can be found with hipBLAS.

Jun 15 '21 23:06 kabicm

I also spent an hour yesterday trying to build rocBLAS for Cuda - without success so far. And even if we get it to work eventually, it will probably be rather brittle because it pulls in many additional dependencies. In comparison hipBLAS is very lightweight.

Jun 16 '21 08:06 oschuett

That's unfortunate, it would have made things easier. I think, adding hipBLAS alongside rocBLAS does not make much sense, since adding a code path, that's only used for testing somewhat defeats it's purpose. So I'd suggest two options:

Replace rocBLAS with hipBLAS. Looking at the hipBLAS code, I suspect it would not really impact performance.
Keep rocBLAS and work on adding CI on AMD hardware for COSMA. In light of LUMI, there should be more support for this in the future. For building CP2K with HIP on Nvidia hardware, you could compile COSMA with CUDA (which you then require anyway), since it is standalone.

Jun 17 '21 08:06 AdhocMan

In both hipBLAS and rocBLAS, there is a heavy overhead in the first DGEMM call. If rocBLAS is used, one could call rocblas_initialize() before any DGEMM call to avoid this overhead in the timing region. We can try to do this in COSMA and shift to a hipBLAS implementation for improving portability. I will work with Marko later this week to get this done.

Jun 21 '21 15:06 gsitaram

I already have a version of Tiled-MM with hipblas instead of rocblas: https://github.com/AdhocMan/Tiled-MM/tree/hipblas It compiles fine, I just can't run it at the moment because of an issue with our test system.

Jun 21 '21 16:06 AdhocMan

COSMA COSMA copied to clipboard

Add support for hipBLAS

COSMA
COSMA copied to clipboard