rocSOLVER Move sterf to CPU; Add experimental parallelism for sterf

Hi, I've been trying to improve performance of SYEVD function lately. The sterf kernel is the most time-consuming part of the code. I tried to use two ways to improve it:

split tridiagonal matrix based on Gershgorin intervals (similar to ScaLAPACK, https://github.com/Reference-ScaLAPACK/scalapack/blob/master/SRC/slarre2.f);
move sterf execution to CPU.

Efficiency of the first approach depends on input values of tridiagonal matrix and doesn't provide stable improvement. The second variant provides significant acceleration in all cases (acceleration in 10 times).

Dsyevd comparison

I am attaching the relevant patch.

Aug 23 '22 12:08 mdvizov

Thanks. I'm working on SYEVD performance myself, but this is a very different approach. I'm not able to review your changes right this moment, but I will take a look as soon as possible. I know @jzuniga-amd has talked about hybrid host/device algorithms before. Thus far, rocSOLVER has been a purely GPU library but I don't think we've ruled out hybrid approaches.

I wonder if an environment variable might be better than a compile-time flag for switching from a pure GPU algorithm to a hybrid algorithm. In general, that sort of decision may depend on the balance of CPU and GPU resources available, which is a very dynamic condition. It's a function of both the hardware capacity and the mix of other jobs running on the same hardware.

Aug 24 '22 13:08 cgmb

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

Aug 25 '22 12:08 mdvizov

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

It looks like there are failures in several functions unrelated to sterf. This seems to be affecting other PRs too, so it doesn't appear to have been caused by your changes.

Aug 25 '22 17:08 tfalders

Does anyone know how to access logs for failing builds btw? It seems they are on some internal AMD CI server.

I'm afraid it's not possible for community members to see the logs. If it were an actual issue, I'd copy the output into this thread. However, the cause of the failure has nothing to do with your changes.

rocSOLVER PRs are tested using the latest build of the corresponding rocBLAS branch and it seems there's been a bug in syr2k on the rocBLAS develop branch. The rocBLAS team is aware of the problem and they'll fix it before release, but it will cause failures in the rocSOLVER CI until it is fixed.

Aug 29 '22 11:08 cgmb

I believe I addressed all code review remarks. Let me know if anything else is needed.

Sep 22 '22 07:09 mdvizov

The team is still discussing the best ways to introduce pure-CPU or hybrid algorithms into the library. There is nothing wrong with offloading computations to the CPU (especially when the problem/algorithm is not very suitable for GPUs), either by calling internal code (sterf_cpu) or linking to another CPU library (lapack_sterf), but the change must be introduced carefully.

Currently, rocSOLVER is a 100% GPU library; calls to rocSOLVER APIs are asynchronous on the host; they can return immediately even if the computations on the device are not done yet and, in practice, no CPU cycles are used. This behavior is expected by some users and changing it could be problematic for some workflows, especially if the switch is embedded in the build process (at compile time). If users can purposely switch between CPU, GPU or hybrid modes at run time could provide more flexibility, but we need to carefully plan the design as this is also related to other upcoming features in our roadmaps.

(The part of this PR that optimizes/parallelizes the GPU code would be easier to review and merge at this time -and we can leave the CPU/hybrid stuff for a future PR-, but it is up to you whether you want to divide your contribution into two different PRs).

Normally, when we add a new function to the library, we initially focus on the correctness of the algorithm (accuracy of results, etc.) and the functionality of the API. For the optimization round, we want that the performance gain really justifies the introduction of, possibly, more complicated code and its implications on the code maintenance process. It is clear that the sterf_cpu code performs better on the tested cases, but there are other questions I would like to explore as well:

Is there any difference between the in-house sterf_cpu code and lapack_sterf? Which one is faster? CPU options are running batch problems in a sequential for-loop; is the gain in performance enough to overperform the GPU code that runs the batch in parallel? And what about the GPU+optimizations code? If this is not the case, we should find and add a switch-size for the batched and strided_batched routines, or limit CPU code to only normal non-batch executions (especially if the CPU code options are only enabled/disabled at compile time). I also wonder what is the effect of the optimizations/parallelization on the GPU code; how does it compare against the original code and against sterf_cpu or lapack_sterf? Is the performance gain enough to justify merging only the optimized GPU code for now?

We also need to think of a rationale that justifies the proposed approach on a higher level in a workflow. Advanced users may know whether their problems are better suited for CPU or GPU computations, and there is nothing preventing them to use different libraries according to their needs; rocSOLVER is not intended to be a single-entry-point fit-all-cases solution, if a user doesn’t get enough performance out of a GPU solution, they can directly call a CPU library instead (and AMD CPU library or others). Right now, with the memory model that we are using, I don’t see why a user would like to transfer data to the GPU to be able to call a rocSOLVER API that will simply copy the data back to the host to perform the computations on the CPU (and this is essentially what rocsolver_sterf will end up doing).

A scenario like this (with internal data transfers) makes more sense with a truly hybrid code, i.e. a code that processes the data via a GPU-CPU collaboration, which is not the case of the proposed rocsolver_sterf. Another scenario that, IMO, may justify the use of the pure CPU rocsolver_sterf is when the data is already on the GPU, which is the case of syevd, for example. Rocsolver_syevd needs the data on the GPU to perform the initial tri-diagonalization before calling the pure-CPU routines sterf_cpu or lapack_sterf. These kinds of “hybrid” functions could make sense but, as I said, we just need to think of the best way to integrate them and, especially, how to document them in the Users Guide.

In the meantime, please notice that syevd is not the only function that calls sterf. Sterf could be called by rocsolver_syev and rocsolver_stedc as well. These functions could be then “hybrid” in the same sense and take advantage of the more efficient sterf implementation. So, they may need the same amendments as rocsolver_syevd (in particular, looking only at the optimizations on the GPU code, I am not sure if rocsolver_syev or rocsolver_stedc will work if the library is built with EXPERIMENTAL mode as the workspace requirements of sterf are changing). This is something that would need to be addressed, either on this PR or a future one.

Sep 22 '22 18:09 jzuniga-amd

Hi @mdvizov. Thank you for your patience as we've considered if and how to include CPU methods in rocSOLVER.

With commit 79587270f091541d39d8a69272047559a7d5fd49 we added our very first hybrid (CPU+GPU) method to rocSOLVER and added some infrastructure to switch between hybrid mode and GPU-only mode. Since this PR is quite old and doesn't use the new infrastructure, I took the liberty of recreating aspects of your changes by adding hybrid support to STERF in commit 1b5c58b0e29041960cab73594d9c867beccdfa8d, and the team is hard at work on improving the GPU-only performance of SYEVD. As a result, I'll be closing this PR as these changes are no longer necessary.

Mar 03 '25 22:03 tfalders