Interface ectrans with GPU backend
When the feature "ECTRANS_GPU" is enabled, atlas will now offload all possible spectral transforms to ectrans with GPU backend. Note that as of now not all functionality is implemented, and a not-implemented exception will be thrown. The unit-tests by default ignore the not implemented features, triggered by such exception. The workings of the exception handling depends on a ectrans pull request: ecmwf-ifs/ectrans#193 Without the ectrans pull requests the tests will compile but abort/crash at run-time.
Tagging @fmahebert FYI
Hi @wdeconinck, I'm wondering what the current status of this PR is? The reason that I ask is that I've tried to build and run it locally with a GPU enabled build of ecTrans but I'm getting errors running some of the Atlas trans tests. For example, in the test_nomesh case when running atlas_test_trans, the comparison of spf (see https://github.com/ecmwf/atlas/blob/f988397e4f8d0e0fdc1257c8936eedc88c729697/src/tests/trans/test_trans.cc#L377) fails (which is after the scatter call that internally uses ecTrans's dist_spec).
I've build ecTrans using the NVHPC/25.1 compilers and the HPC-X MPI implementation (OpenMPI 4.1.7) that the SDK comes with and all the ecTrans tests pass (CPU and GPU). However, when I link Atlas to ecTrans and run the tests I get failures as I mentioned above. I'm starting to wonder if perhaps I might not be building things correctly or I'm missing some runtime flag. Would you be able to share how you've built this branch of Atlas and ecTrans?
Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.
My loaded modules:
- cmake/3.28.3 2) prgenv/nvidia 3) gcc/11.2.0 4) nvidia/22.11 5) hpcx-openmpi/2.14.0-cuda 6) eigen/3.4.0 7) fftw/3.3.10 8) ninja/1.11.1
Note I am not using the openmpi that came with the SDK here.
I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON
Now rebased on latest release.
Private downstream CI failed. Workflow name: private-downstream-ci View the logs at https://github.com/ecmwf/private-downstream-ci/actions/runs/14400765285.
Hi @l90lpa I have just tested this with NVHPC 22.11 and saw no issues like that.
My loaded modules:
1. cmake/3.28.3 2) prgenv/nvidia 3) gcc/11.2.0 4) nvidia/22.11 5) hpcx-openmpi/2.14.0-cuda 6) eigen/3.4.0 7) fftw/3.3.10 8) ninja/1.11.1Note I am not using the openmpi that came with the SDK here.
I built following projects with these cmake options: fiat : -DENABLE_MPI=ON ectrans: -DENABLE_ACC=ON -DENABLE_GPU=ON atlas: -DENABLE_ACC=ON -DENABLE_CUDA=ON -DENABLE_ECTRANS=ON -DENABLE_ECTRANS_GPU=ON
Hi @wdeconinck, thanks for getting back to me and sharing your build set-up! I'll try to recreate a similar environment and see if I have better luck.
Hi @wdeconinck, thanks again for sharing your build environment. I was able to get Atlas+ecTrans working using NVHPC 22.11. However, I've been having trouble building some of our code (and dependencies) with NVHPC 22.11 compilers, and so I was wondering if you have a build environment with a recent version of NVHPC that you know works? The reason I ask is because I seem to get test failures when I move to newer versions of NVHPC as mentioned above.
I could reproduce some issues with nvidia/24.5. The issues seem not to stem from using ectrans-gpu. I will try to fix or workaround separately from this PR, and then rebase this on develop once merged.
I have managed to compile atlas with nvidia/24.5 and nvidia/24.11 using #278. I have rebased this branch including these changes. It should now work.
Another thing... By default all atlas tests are run with floating-point-exception trapping enabled.
For nvidia versions later than 22.11 it seems that some intrinsic functions like atan2(y,x) result in avx2-optimised versions (depending on optimization level) which still signal a FE_DIVBYZERO, even if there's a protection with
if(x!=0) atan2(y,x)
because the masking in vectorised code comes after the signal has been sent with AVX2. For this reason it may be required to turn off floating-point-exception trapping (only for running the tests). You can do this in the environment with
export ATLAS_FPE=0