E3SM
E3SM copied to clipboard
nvlink multiple definition of device kernel
In a case created with
./cime/scripts/create_test --compiler pgigpu PFS_P2560.T62_oRRS18to6v3.GMPAS-IAF.summit_pgigpu.bench-gmpas_noio
getting
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_647_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_633_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_608_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_531_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_517_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_492_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink fatal : merge_elf failed
pgacclnk: child process exit status 2: /autofs/nccs-svm1_sw/summit/nvhpc_sdk/rhel8/Linux_ppc64le/21.11/compilers/bin/tools/nvdd
that's built once (as a "common" source) but included twice: one through libice.a, another through libocn.a . It appears the recommendation is to create a different library for MPAS-framework gpu kernels and link it separately.
Can't reproduce this issue any more. Closing.
Oops. Turns out that maint-2.0
does not have this issue, but master
does. Re-opening.
I also have the same issue using E3SM2.0 (with pgi compiler and openmpi), did you solve this? can I get some help? THANKS A LOT!
@jgfouca can you take a look at this?
Here is my pgigpu_gpunode.cmake
if (COMP_NAME STREQUAL gptl) string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_GETTIMEOFDAY") endif()
if (NOT DEBUG) string(APPEND CFLAGS " -O3 -Mvect=nosimd") endif() if (NOT DEBUG) string(APPEND FFLAGS " -O1 -Mvect=nosimd -DSUMMITDEV_PGI") endif() #string(APPEND LDFLAGS " -Minline -ta=nvidia,cc80,cuda11.2,ptxinfo -Mcuda -Minfo=accel") string(APPEND LDFLAGS " -gpu=cc86,deepcopy -Minfo=accel") execute_process(COMMAND $ENV{NETCDF_FORTRAN_PATH}/bin/nf-config --flibs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) string(APPEND SLIBS " ${SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0} -L$ENV{PNETCDF_PATH}/lib -lpnetcdf -lhdf5_hl -lhdf5 -lnetcdf -lnetcdff -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib -llapack -lblas") set(KOKKOS_OPTIONS "--with-cuda=$ENV{CUDA_DIR} --with-cuda-options=enable_lambda") execute_process(COMMAND $ENV{NETCDF_C_PATH}/bin/nc-config --libs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) execute_process(COMMAND $ENV{PNETCDF_PATH}/bin/pnetcdf-config --libs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) string(APPEND SLIBS " ${SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0}") set(SUPPORTS_CXX "TRUE") set(CXX_LINKER "FORTRAN") set(CXX_LIBS "-lstdc++") set(NETCDF_C_PATH "$ENV{NETCDF_C_PATH}") set(NETCDF_FORTRAN_PATH "$ENV{NETCDF_FORTRAN_PATH}") set(PNETCDF_PATH "$ENV{PNETCDF_PATH}")
My MPAS-G case was also failed with this error on Perlmutter(pm-gpu) using nvidiagpu compiler.
nvlink error : Multiple definition of '_mpas_timekeeping_21' in '../../mpas-framework/src/libocn.a:mpas_timekeeping.f90.o', first defined in 'CMakeFiles/e3sm.exe.dir/global/u2/y/youngsun/repos/github/E3SM/driver-mct/main/cime_comp_mod.F90.o' nvlink fatal : merge_elf failed
FYI, I used following newcase command line:
/global/homes/y/youngsun/repos/github/E3SM/cime/scripts/create_newcase --case "${CASEDIR}" --case-group mpaso-benchmark-2023 --res T62_oRRS18to6v3 --compset GMPAS-IAF --compiler nvidiagpu --machine pm-gpu --project m4259_g --output-root /global/cfs/cdirs/e3sm/youngsun/mpas/G-bench --walltime 02:00:00
Sorry, this fell off my radar. I will take a look this week.
I'm the NERSC support staff person that has this as a ticket. I'll offer to help if there is something needed from a facility perspective.
This line in the description at the top:
It appears the recommendation is to create a different library for MPAS-framework gpu kernels and link it separately. Are you referring to the posting here? https://forums.developer.nvidia.com/t/nvlink-error-multiple-definition-errors-when-linking-to-the-same-library-twice/62892 It sounds like the build structure needs to be re-organized not to duplicate the contents of the libraries.
I'm wondering if a flag like -Wl,--allow-multiple-definition might get it to tolerate the duplication between the libraries. I'm not sure if this would be the right way to express it. But the --allow-multiple-definition is a flag to "ld" that sounds like the problem you're seeing. Also, if this problem showed up on Perlmutter but not other systems, it could be that the default compiler/linker flags could be set differently.
@cponder , thanks for the info. I searched the flag in the build log and found that the "-Wl,--allow-multiple-definition" flag is already used in the linker command line when e3sm.exe is created. I think the linker flag is defined in nvidiagpu_pm-gpu.cmake in cmake_macros folder.
@philipwjones Any thoughts on how to resolve this?
Hi all, apologies for delays on this. I am on perlmutter and I see the same error with case SMS.T62_oRRS18to6v3.GMPAS-IAF.pm-gpu_nvidiagpu
. Working on a fix now.
@cponder , It looks like -Wl,--allow-multiple-definition
is already in the link line, so unfortunately that won't be the fix we need.
@sarats - just back from vacation. As @amametjanov noted after this first came up, we had this issue a while ago but it went away - presumably as part of a compiler upgrade at some point. So we never really had to deal with it. I can't remember how the MPAS builds work in E3SM - I thought we were only building a single version of the framework (that's why mpas-framework is separate now) and then the MPAS components should be linking to that library. I could be wrong about that, but @jgfouca should know. We shouldn't need multiple instances of framework code.
@philipwjones , I will try some different compiler versions because I am otherwise stumped.
The error is:
nvlink error : Multiple definition of '_mpas_timekeeping_21' in '../../mpas-framework/src/libocn.a:mpas_timekeeping.f90.o', first defined in 'CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o'
But CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o
doesn't define mpas_timekeeping:
% nm ./cmake/cpl/CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o | grep -i mpas
%
@jgfouca - I'm guessing it might get drawn into the cime_comp_mod via a nested include from the ocean model driver?
OK, I have a fix. The problem seems to be the fact that I made the MPAS common library an OBJECT library, as opposed to a SHARED or STATIC library, and then had all the MPAS components link in all the common objects. I unfortunately have no recollection as to why I did things that way but it seems to work when I do things a more standard way, so we should do it the standard way unless we see a reason not to. PR coming.