OceanBioME.jl Run Nsight on the code base

Profile the code base, using the example runs from #259, with Nsight.

Tasks

[x] Get system overview for LOBSTER run with ~~512x512x256~~ (256x256x64) grid

Jul 08 '25 13:07 MarionBWeinzierl

Suggested runs (@AdelekeBankole) :

check I/O influence
scaling with no of grid points
scaling with no of particles (keeping fixed problem size)
create plots

Aug 08 '25 15:08 MarionBWeinzierl

To help with the initial scaling study I made some modifications to the big_LOBSTER test case (to accept changing number of grid points and particles) together with a script to run multiple cases and collect basic Julia metrics (runtimes, allocations, GC etc.) in a CSV file. I have pushed them to the profiling101 branch in this commit.

It should allow us to create a basic plots of 'problem size' vs 'runtime'. I am currently trying to run one on CSD3, CPU and a single core for range of x1/4 to x8 grid size. Will try to do the same for the GPU later in the week.

Aug 18 '25 13:08 Mikolaj-A-Kowalski

The question we want to answer is: Where are the bottlenecks, where does the code spend time?

Aug 29 '25 15:08 MarionBWeinzierl

Another question to answer is: How much time is spent transferring between GPU and CPU?

Aug 29 '25 15:08 MarionBWeinzierl

Turns out that we have been running NSight on CSD3 a bit wrong. The workflow that was working at the beginning of August stopped working recently. Turns out that the reason for that was CUDA 13 release on the 5th of August!

My understanding of the problem and our solution is the following (but I cannot guarantee that it is 100% correct):

By default, CUDA.jl pulls its own CUDA toolkit, compiler and other binary dependencies from the special CUDA_XXX_jl.jl packages hosted within JuliaBinaryWrappers GitHub organisation. However the Nsight we were using for profiling comes from the CUDA 12.1 toolkit installed on CSD3. Before sine the Julia was pooling packages compatible with version 12.9 it was not ideal, but was still working with the profiler from 12.1. However, recently the packages the CUDA.jl pulls were upgraded to version 13, which broke the compatibility with the profiler, which started to generate us a lot of errors and trace results were nonsensical (i.e. 2% of calculation time spend in execution on GPU!).

The fix to the problem is to make CUDA.jl use the CUDA toolkit version installed on CSD3. To do that, in the project that contains the calculation, one needs to (assuming module load cuda/12.1 was loaded):

Add this entry to the Project.toml

[extras]
CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"

And create new file LocalPreferences.toml with the following contents:

[CUDA_Runtime_jll]
__clear__ = ["local"]
version = "12.1"

Apparently in this configuration CUDA.jl still fetches CUDA toolkit from the .jl packaged as opposed to useing the installed one... but at least version is compatible. NSys executes without errors and trace results make sense!

Sep 11 '25 12:09 Mikolaj-A-Kowalski

The raw Nsyst results are now on GoogleDrive in Nsys-system-results/WorkingTraces directory. The CSV summaries of the GPU runtimes (all kernels + memory transfers) are here (with and without IO):

Please note that the names of the kernels are long (i.e. file does not open well in LibreCalc due too many characters in the cell!).

A short summary is:

The question we want to answer is: Where are the bottlenecks, where does the code spend time?

gpu_compute_Gc_..., gpu_compute_Gu_..., gpu_compute_Gw_... kernels. Will need to grep a bit to try to find out what they are and where they live in the source code. EDIT2: ~~EDIT: Here apparently 😉~~ In locations indicated in the @jagoosw post below (not in shallow water model we are not using 🙃)

Another question to answer is: How much time is spent transferring between GPU and CPU?

Not much. In the most involved case of the run with IO, the cumulative time for memory transfers is:

2.8% of total GPU time: Device to host transfers (basically data fetched to Host to do IO)
0.4% of total GPU time Device to device transfers
<0.1% of total GPU time: Host to device transfers

Sep 11 '25 13:09 Mikolaj-A-Kowalski

Since you're using the non-hydrostatic model its here: https://github.com/CliMA/Oceananigans.jl/blob/main/src/Models/NonhydrostaticModels/compute_nonhydrostatic_tendencies.jl and the functions called from compute_G... are here: https://github.com/CliMA/Oceananigans.jl/blob/main/src/Models/NonhydrostaticModels/nonhydrostatic_tendency_kernel_functions.jl

Sep 11 '25 16:09 jagoosw

Note from the meeting today. @johnryantaylor suggested to switch off part of the physics to see if/how it will influence performance of the compute_Gc kernel. The things to check are:

[ ] Carbon-chemistry physics (surface fluxes)
[ ] Light physics

I hope I did not misinterpret anything 🤞

Sep 12 '25 11:09 Mikolaj-A-Kowalski

Also the next steps for investigating the kernels are going to be to turn them under NSight Compute and see what is the cause of the low SM occupancy and low SM instruction issue metrics in the NSight trace.

To learn the tool we will probably start with the scale_for_neg kernel since it is much more simpler.

Sep 12 '25 11:09 Mikolaj-A-Kowalski

The light integration doesn't occur in the compute_Gc kernel but during the update_state step

Sep 12 '25 13:09 jagoosw

Thanks @jagoosw, Mikolaj found that the GPU utilization is pretty low during the compute_Gc kernel. I thought that it might be useful to progressively simplify the LOBSTER model to see if certain functions aren't being distributed very evenly across the GPU. Can you think of good places to look?

Sep 12 '25 14:09 johnryantaylor

Maybe we can close this issue since we have a workflow for running the NSight on OceanBioME on CSD3. Kernels that we should target for optimisation now have separate issues.

Oct 03 '25 16:10 Mikolaj-A-Kowalski

Perhaps you could write up how todo it and convert this roll a discussion?

Oct 03 '25 17:10 jagoosw

OceanBioME.jl OceanBioME.jl copied to clipboard

Run Nsight on the code base

Tasks

OceanBioME.jl
OceanBioME.jl copied to clipboard