atlas icon indicating copy to clipboard operation
atlas copied to clipboard

Interface ectrans with GPU backend

Open wdeconinck opened this issue 1 year ago • 9 comments

When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend. Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown. The unit-tests by default ignore the not implemented features, triggered by such exception. The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193 Without the ectrans pull requests the tests will compile but abort/crash at run-time.

wdeconinck avatar Dec 19 '24 21:12 wdeconinck

Tagging @fmahebert FYI

MarekWlasak avatar Dec 20 '24 07:12 MarekWlasak

Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the test_nomesh case when running atlas_test_trans, the comparison of spf (see https://github.com/ecmwf/atlas/blob/f988397e4f8d0e0fdc1257c8936eedc88c729697/src/tests/trans/test_trans.cc#L377) fails (which is after the scatter call that internally uses ecTrans's dist_spec).

I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans?

l90lpa avatar Apr 09 '25 20:04 l90lpa

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:

  1. cmake/3.28.3 2) prgenv/nvidia 3) gcc/11.2.0 4) nvidia/22.11 5) hpcx-openmpi/2.14.0-cuda 6) eigen/3.4.0 7) fftw/3.3.10 8) ninja/1.11.1

Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

wdeconinck avatar Apr 11 '25 09:04 wdeconinck

Now rebased on latest release.

wdeconinck avatar Apr 11 '25 09:04 wdeconinck

Private downstream CI failed. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf/private-downstream-ci/actions/runs/14400765285.

github-actions[bot] avatar Apr 11 '25 10:04 github-actions[bot]

Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.

My loaded modules:

1. cmake/3.28.3   2) prgenv/nvidia   3) gcc/11.2.0   4) nvidia/22.11   5) hpcx-openmpi/2.14.0-cuda   6) eigen/3.4.0   7) fftw/3.3.10   8) ninja/1.11.1

Note I am not using the openmpi that came with the SDK here.

I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON

Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck.

l90lpa avatar Apr 11 '25 12:04 l90lpa

Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above.

l90lpa avatar Apr 22 '25 14:04 l90lpa

I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu. I will try to fix or workaround separately from this PR, and then rebase this on develop once merged.

wdeconinck avatar Apr 29 '25 21:04 wdeconinck

I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work.

Another thing... By default all atlas tests are run with floating-point-exception trapping enabled. For nvidia versions later than 22.11 it seems that some intrinsic functions like atan2(y,x) result in avx2-optimised versions (depending on optimization level) which still signal a FE_DIVBYZERO, even if there's a protection with

if(x!=0) atan2(y,x)

because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with

export ATLAS_FPE=0

wdeconinck avatar May 06 '25 11:05 wdeconinck