amrex icon indicating copy to clipboard operation
amrex copied to clipboard

Update CMake to support newer GPU architectures

Open samuelpmish opened this issue 5 months ago • 10 comments

Summary

I was trying out amr-wind earlier and found its CMake build was unable to configure for Blackwell architecture GPUs:

cmake . -Bbuild [...] -DCMAKE_CUDA_ARCHITECTURES=100
...
...
CMake Error at /usr/share/cmake-4.0/Modules/FindCUDA/select_compute_arch.cmake:245 (message):
  Unknown CUDA Architecture Name 10.0 in CUDA_SELECT_NVCC_ARCH_FLAGS
Call Stack (most recent call first):
  Tools/CMake/AMReXUtils.cmake:265 (cuda_select_nvcc_arch_flags)
  Tools/CMake/AMReXParallelBackends.cmake:99 (set_cuda_architectures)
  Src/CMakeLists.txt:40 (include)

It seems that the underlying cause was not in amr-wind itself, but in AMReX's use of some deprecated CMake CUDA features.

This PR makes a small change to the CMake build system to avoid those deprecated features, so that AMReX can compile with Hopper and Blackwell architecture GPUs. The configuration behavior is as follows:

cmake . -DAMReX_GPU_BACKEND=CUDA will default to the "native" option (which selects architecture based on the hardware present in the machine)

cmake . -DAMReX_GPU_BACKEND=CUDA -DCMAKE_CUDA_ARCHITECTURES=100 builds for the explicitly-specified architecture(s)

Tasks

  • [ ] ensure "native" compilation still picks he local GPU, if present, with the same precedence of user hints as before
  • [ ] ensure user interface does not break, e.g., AMREX_CUDA_ARCH env hint still works and has the same precendence
  • [ ] Update docs/logic on Device LTO and avoid to break users.
  • [ ] Ensure that for HPC machines, we pre-compile (i.e., at least when AMREX_CUDA_ARCH is set / CMAKE_CUDA_ARCHITECTURES is selected to a narrow set) down to SASS code, otherwise we will in an MPI context compile every process on startup from PTX to SASS, potentially 10's of thousands of times.
  • [ ] CUDA 12.9 raises: Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release

Checklist

The proposed changes:

  • [x] fix a bug or incorrect behavior in AMReX
  • [x] add new capabilities to AMReX
  • [x] are likely to significantly affect the results of downstream AMReX users
  • [ ] include documentation in the code and/or rst files, if appropriate

samuelpmish avatar Jul 17 '25 15:07 samuelpmish

Thank you for this!

Yes #3948 is over-due and it looks like we also have some breakage with the latest CUDA 12.9 and CMake now with the old logic.

I am OoO for the rest of the week, but we should try to get this in for the next release of AMReX and I will try to help next week.

Bumping CMake to 3.24+ globally is fine now, please go ahead.

Please check that we can use the build mode of building for the local "native" GPU when one is discovered, to simplify development. Otherwise, let us keep the AMREX_ARCH env hint (if set) that we use so far to set a default for CMAKE_CUDA_ARCHITECTURES.

ax3l avatar Jul 24 '25 05:07 ax3l

@samuelpmish There is a lot of legacy logic in Tools/CMake/AMReXUtils.cmake and others where we wrange CUDA archs and CUDA device LTO flags.

Can you remove/clean those out if you have a chance?

I can help next week, too.

ax3l avatar Jul 24 '25 05:07 ax3l

@cyrush can you potentially update the Catalyst image we use in AMReX/WarpX to include CMake 3.24 or newer? :pray:

ax3l avatar Jul 24 '25 05:07 ax3l

@c-wetterer-nelson can you potentially update the SENSEI image we use in AMReX/WarpX to include CMake 3.24 or newer? :pray:

ax3l avatar Jul 24 '25 05:07 ax3l

Hey Axel, are there still SENSEI users across AMReX/WarpX?

c-wetterer-nelson avatar Jul 24 '25 20:07 c-wetterer-nelson

@ax3l our current ascent containers (0.9.4) are using CMake 3.28. The internal layout changed a bit, but I can address that. It looks like you are using Ascent 0.9.2.

Ascent 0.9.3 (also available) is using CMake 3.26.3.

So a quick update to ascent 0.9.3 will get you beyond CMake 3.24, down the road I can help get 0.9.4 working your CI.

cyrush avatar Jul 25 '25 21:07 cyrush

@c-wetterer-nelson

Hey Axel, are there still SENSEI users across AMReX/WarpX?

That is a good question, I think in WarpX not anymore.

@WeiqunZhang should we drop SENSEI throughout AMReX & BLAST codes?

ax3l avatar Jul 29 '25 16:07 ax3l

I have to move this PR and testing it downstream into the 25.09 release cycle, due to other deadlines.

There is a hotfix for CUDA 12.9 for now in #4589

ax3l avatar Jul 30 '25 18:07 ax3l

I added a task list to the PR description on things we will need to carefully check with downstream codes to avoid breakage.

ax3l avatar Jul 30 '25 18:07 ax3l

Hey, sorry to disappear for a bit after posting this PR. I'm not sure I understand AMReX's build system enough to address some of the tasks on my own. Can someone help clarify the requirements for the listed tasks:

- ensure "native" compilation still picks he local GPU, if present, with the same precedence of user hints as before
- ensure user interface does not break, e.g., AMREX_CUDA_ARCH env hint still works and has the same precendence
- Update docs/logic on Device LTO and avoid to break users.
- Ensure that for HPC machines, we pre-compile (i.e., at least when AMREX_CUDA_ARCH is set / CMAKE_CUDA_ARCHITECTURES is selected to a narrow set) down to SASS code, otherwise we will in an MPI context compile every process on startup from PTX to SASS, potentially 10's of thousands of times.
- CUDA 12.9 raises: Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release

It seems like some of them are already satisfied (e.g. native compilation picking the local GPU). I believe the -arch=native will generate SASS for the available GPUs already.

Other ones like the warning about sm_75 can be suppressed with a flag, but I'm not sure that's always a good thing (as it hides important info from users with those cards).

samuelpmish avatar Aug 07 '25 22:08 samuelpmish