COSMA
COSMA copied to clipboard
Add support for hipBLAS
It would be nice to have support for hipBLAS as this allows to test the hip code path on a Nvidia device.
In return the support for rocBLAS could be dropped.
Since @gsitaraman mentioned here the potential performance benefit of using rocBLAS, I would still hesitate to switch to hipBLAS, unless you believe it's absolutely necessary.
I agree that you should keep rocBLAS as long as there are performance benefits to it. However, maybe you could still add support for hipBLAS alongside? It would allow me to make progress with the HIP integration testing in CP2K.
This should be possible. @AdhocMan, do you think this would be a simple change, or would it require adding a lot of boilerplate code to the gpu-backend?
hipBLAS is just an additional layer on top of rocBLAS and you can compile rocBLAS with cuda. So it should work already, although I haven't tested it.
Would we still have to add hipBLAS as a dependency of COSMA in that case?
No, if rocBLAS works correctly with cuda. There would just be few changes to the CMake config required.
As written here it seems the cmake for the CUDA backend of rocBLAS is broken at the moment:
rocBLAS is written with HiP kernels, so it should build and run on CUDA platforms. However, currently the cmake infrastructure is broken with a CUDA backend. However, a BLAS marshalling library that presents a common interface for both ROCm and CUDA backends can be found with hipBLAS.
I also spent an hour yesterday trying to build rocBLAS for Cuda - without success so far. And even if we get it to work eventually, it will probably be rather brittle because it pulls in many additional dependencies. In comparison hipBLAS is very lightweight.
That's unfortunate, it would have made things easier. I think, adding hipBLAS alongside rocBLAS does not make much sense, since adding a code path, that's only used for testing somewhat defeats it's purpose. So I'd suggest two options:
- Replace rocBLAS with hipBLAS. Looking at the hipBLAS code, I suspect it would not really impact performance.
- Keep rocBLAS and work on adding CI on AMD hardware for COSMA. In light of LUMI, there should be more support for this in the future. For building CP2K with HIP on Nvidia hardware, you could compile COSMA with CUDA (which you then require anyway), since it is standalone.
In both hipBLAS and rocBLAS, there is a heavy overhead in the first DGEMM call. If rocBLAS is used, one could call rocblas_initialize() before any DGEMM call to avoid this overhead in the timing region. We can try to do this in COSMA and shift to a hipBLAS implementation for improving portability. I will work with Marko later this week to get this done.
I already have a version of Tiled-MM with hipblas instead of rocblas: https://github.com/AdhocMan/Tiled-MM/tree/hipblas It compiles fine, I just can't run it at the moment because of an issue with our test system.