OpenROAD Convert DG-RePlAce algorithm to Kokkos

This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.

Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.

Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using ariane133 design.

	original placer	CUDA implementation	Kokkos (CUDA backend)	Kokkos (OpenMP backend)	Kokkos (Threads backend)
ariane133 global place time	11:27.39	0:57.70	1:33.49	3:24.12	6:08.94

Jul 08 '24 09:07 kamilrakoczy

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

Jul 08 '24 15:07 maliberty

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

Earlier measurements were done when some parts was still using native CUDA and using different design (black-parrot). This measurements are single run on local machine while using it for other things too, so they are not very accurate.

I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further.

Jul 09 '24 09:07 kamilrakoczy

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

Do all the various versions produce the same result? That is also important.

Jul 09 '24 17:07 maliberty

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

Jul 09 '24 17:07 maliberty

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved.

We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.

Do all the various versions produce the same result? That is also important.

Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results.

Jul 09 '24 18:07 QuantamHD

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion.

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

A 50% overhead is worth exploring to at least understand if not eliminate.

Jul 09 '24 20:07 maliberty

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.

A 50% overhead is worth exploring to at least understand if not eliminate.

+1 I just want to point out if this is the fastest we could go that seems fast enough for me.

Jul 09 '24 21:07 QuantamHD

Do all the various versions produce the same result? That is also important.

No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result. We investigated this and it turned out that it is because Kokkos passes all files that depends on it through nvcc_wrapper. This wrapper converts host compiler options (g++) to nvcc options and uses nvcc to compile all Kokkos-dependent sources. This is done to allow device code in single .cpp file instead of separate .cu file for it.

NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler.

We checked that when nvcc is used to compile InitialPlace, Eigen solveWithGuess returns different results with exactly the same inputs comparing to using g++ directly.

I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.

A 50% overhead is worth exploring to at least understand if not eliminate.

I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:

	CUDA implementation	Kokkos (CUDA backend)
ariane133 global place time without initial placement	0:55.52	0:58.25

I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using ariane133 design:

	min time [min]	avg time [min]	med time [min]	max time [min]
CUDA implementation	0:45	0:48	0:47	0:53
Kokkos (CUDA backend)	1:53	1:57	1:57	2:00
Kokkos (OpenMP backend)	1:50	2:04	1:54	2:37
Kokkos (threads backend)	3:42	3:43	3:43	3:45

Jul 18 '24 13:07 kamilrakoczy

Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is nvcc calling g++ with different flags?

Jul 18 '24 15:07 maliberty

Is nvcc calling g++ with different flags?

Arguments that are passed to nvcc and that nvcc should pass to g++ are the same. I haven't investigated yet how (with what flags) g++ is invoked from nvcc.

Jul 19 '24 07:07 kamilrakoczy

another possibility is that it is invoking a different g++ binary from another path

Jul 19 '24 15:07 maliberty

Converted to a draft due to no progress.

Oct 14 '24 04:10 maliberty

I've rebased this branch onto latest master and started resolving the mentioned issues:

Eigen’s solveWithGuess() behaves differently on the Kokkos branch (with a suggestion that this is caused by nvcc_wrapper, a part of Kokkos responsible for redirecting compilations, not pertaining to CUDA, to the host compiler):

I've found that to not be the case. Early, I've recreated the same condition (where Eigen was running slowly) using clang++ as the Kokkos compiler and I've confirmed that nvcc_wrapper was not used then. The problem was Eigen, when detecting CUDA availability, was trying to use it. Nevertheless, I saw no peak in GPU usage when initial_place was running, so I've disabled it and saw the numbers return to baseline (the same as in the CUDA-native implementation).

What is the performance difference between Kokkos and CUDA-native implementations?

To prioritize merging of GPU-accelerated placement, the focus was to get the branch issue-free before optimizing. In my testing, Kokkos-based algorithm on black-parrot spends about 10 seconds in libcuda.so, whereas the CUDA-native implementation spends around 5. All other timings are comparable, making the entire run about 5 seconds longer.

Future / subsequent work:

Make Kokkos a submodule: Due to varying conditions on host machines, most Kokkos libraries available as a package ship without either CUDA or OMP support. Having a dependency that has to be manually compiled and set correctly to have a functioning and fast implementation might intruduce complexity for the end user. Therefore, I suggest not migrating kokkos-fft to be a dependency and using kokkos, that is already cloned as a submodule to kokkos-fft, as an in-tree library. The issue I'm currently facing is that internal deprecations of CMake symbols are being triggered when Kokkos' compilation is triggered as a child project and not the parent.
Optimize memory accesses and the Kokkos implementation itself: I've confirmed that memory copying is one of the causes of the algorithm being slower, and fixes are in development, waiting for the more pressing issues to be resolved.

Jan 07 '25 13:01 jbylicki

I added a configuration option to etc/Build.sh, -use_gpl2 that will include the gpl2 subdirectory and launch the compilation of kokkos via kokkos-fft in CMake. I additionally assigned the -gpu flag from the build script to enable the CUDA backend in Kokkos.

Jan 09 '25 17:01 jbylicki

I would prefer to see kokkos as part of the dependency installer rather than as a submodule. There should be no need to compile it for each workspace on a machine.

Jan 09 '25 17:01 maliberty

With the current setup, it would be possible to support both compilation schemes, with the priority set towards the DependencyInstaller - if a system-wide Kokkos installation would be detected, it will be used during compilation. I would suggest leaving the possibility to use in-tree Kokkos and kokkos-fft (if kokkos-fft was also moved to be downloaded via DependencyInstaller), as the script is tailored only towards Ubuntu users. If a system-wide package is not detected, both dependencies can be installed via FetchContent and built in-tree.

Jan 09 '25 17:01 jbylicki

If someone wants to put a local copy in-tree that's fine but I'd like to avoid having a submodule.

Jan 09 '25 18:01 maliberty

I'll add support for kokkos and kokkos-fft via the DependencyInstaller then. The submodule could be deleted while keeping in-tree support - CMake would in case of a system-wide package being absent handle the download by the FetchContent directive, and the build would have conditionals in place to link correctly.

Jan 09 '25 18:01 jbylicki

I've added nested parallelism to the most time consuming kernel - computeBCPosNegKernel. After rebasing both branches to the same base commit, the performance results are as follows for the black-parrot design with the CUDA backend:

CUDA-native: 24.606 seconds (total time: 114.50 s, skipped intial place: 94.49 s)
Kokkos: 23.614 seconds (total time: 114.42 s, skipped intial place: 95.07 s)

Additionally, a concern was raised wrt. non-deterministic results that are returned from Kokkos, depending on the compute device used for processing. To validate the flow, each variant was subjected to a run from syntheis to the final step. While it's true that those results are varying, they have minimal impact on the actual parameters of the finished flow. Additionally, the results are deterministic on a per-device basis, even when the compute device is calculating under heavy external loads (especially applicable for GPUs).

Test subjects were:

master branch commit 7e0fce872123, as baseline and base for other branches
cuda-native, the original CUDA-native implementation, rebased onto the same base as other branches
kokkos-cpu, the Kokkos-based flow, ran on the OpenMP backend
kokkos-gpu, the Kokkos-based flow, ran on the CUDA backend

Metrics collected were taken from the final report and log, and were:

Total Negative Slack (tns)
Worst Negative Slack (wns)
Total power
Design area and utilization

Results:

Branch	TNS	WNS	Design area, utilization	Total Power
`master`	-2.42	-2.42	760397 u^2 45% utilization	2.57e-01 W
`cuda-native`	-2.40	-2.40	753511 u^2 44% utilization	2.49e-01 W
`kokkos-cpu`	-2.49	-2.49	753608 u^2 44% utilization	2.50e-01 W
`kokkos-gpu`	-2.44	-2.44	753674 u^2 44% utilization	2.50e-01 W

Feb 12 '25 16:02 jbylicki

Very nice! How is the cpu vs gpu runtime with your latest changes? Is this ready for review?

Feb 12 '25 17:02 maliberty

Yes, it's ready for review. I've applied the suggested clang-tidy fixes and added the missing RockyLinux9 package.

The performance difference between CUDA and OpenMP backends on black_parrot is:

CUDA: 85.38 s (dg_global_place call time: 20.46 s)
OpenMP: 96.58 s (dg_global_place call time: 29.83 s)

The test setup is an Intel i7-8700 and a NVIDIA GTX 1080Ti

Feb 13 '25 19:02 jbylicki

What is the current gpl time on that design & system?

Feb 14 '25 05:02 maliberty

Currently, the mainline gpl runs the global_place call for 653.90s with the total run time being 729.30s

Feb 20 '25 14:02 jbylicki

Hi @sgizler, is this PR ready for review? It is my understanding that the latest status of this PR is that the performance is consistent with the native CUDA implementation but there is a small amount of variability run-to-run. I discussed with @maliberty that if the amount of PPA variability is low, that is acceptable and expected. Please let me know if that status is correct.

Mar 19 '25 00:03 mikesinouye

Small differences due to numerics are expected

Mar 19 '25 01:03 maliberty

There is no variability run-to-run if you execute it on same machine.

Before, there was some variability when you changed the number of threads, or enabled/disabled CUDA. In the worst case I checked (nangate45/ariane133), variability fix incurred ~18% time penalty on CUDA-enabled run, while speeding up non-CUDA run about 14%. On other designs like black parrot the difference was about 3% or less.

There might still be small variations when moving to a entirely different CPU (i.e not supporting some newer SIMD instructions, or entirely different arch) due to underlying FFT library choosing implementation fine-tuned for certain CPUs.

I think the PR is ready for review.

Mar 26 '25 13:03 sgizler

I'll take a look. Please address formatting & tcl-lint.

Mar 26 '25 15:03 maliberty

@vvbandeira please review the dependency install & cmake related changes

Mar 26 '25 15:03 maliberty

If the gpl2 compilation is optional, so should the installation of its dependencies. Especially since the script will install NVIDIA packages and some users might have AMD graphics, GPU support on Linux does not have the reputation of being robust, to my knowledge. Some checks in this regard would be nice.

Mar 26 '25 17:03 vvbandeira

@kamilrakoczy any update? The last comments are unanswered

Apr 23 '25 05:04 maliberty