E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

nvlink multiple definition of device kernel

Open amametjanov opened this issue 2 years ago • 2 comments

In a case created with

./cime/scripts/create_test --compiler pgigpu PFS_P2560.T62_oRRS18to6v3.GMPAS-IAF.summit_pgigpu.bench-gmpas_noio

getting

nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_647_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_633_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_1d_gpu_608_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_531_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_517_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink error   : Multiple definition of 'mpas_vector_reconstruction_mpas_reconstruct_2d_gpu_492_gpu' in '../../mpas-framework/src/libocn.a:mpas_vector_reconstruction.f90.o', first defined in '../../mpas-framework/src/libice.a:mpas_vector_reconstruction.f90.o'
nvlink fatal   : merge_elf failed
pgacclnk: child process exit status 2: /autofs/nccs-svm1_sw/summit/nvhpc_sdk/rhel8/Linux_ppc64le/21.11/compilers/bin/tools/nvdd

that's built once (as a "common" source) but included twice: one through libice.a, another through libocn.a . It appears the recommendation is to create a different library for MPAS-framework gpu kernels and link it separately.

amametjanov avatar Apr 20 '22 20:04 amametjanov

Can't reproduce this issue any more. Closing.

amametjanov avatar Jun 23 '22 02:06 amametjanov

Oops. Turns out that maint-2.0 does not have this issue, but master does. Re-opening.

amametjanov avatar Jun 24 '22 02:06 amametjanov

I also have the same issue using E3SM2.0 (with pgi compiler and openmpi), did you solve this? can I get some help? THANKS A LOT!

lulu1599 avatar Mar 21 '23 02:03 lulu1599

@jgfouca can you take a look at this?

rljacob avatar Mar 21 '23 03:03 rljacob

Here is my pgigpu_gpunode.cmake

if (COMP_NAME STREQUAL gptl) string(APPEND CPPDEFS " -DHAVE_NANOTIME -DBIT64 -DHAVE_SLASHPROC -DHAVE_GETTIMEOFDAY") endif()

if (NOT DEBUG) string(APPEND CFLAGS " -O3 -Mvect=nosimd") endif() if (NOT DEBUG) string(APPEND FFLAGS " -O1 -Mvect=nosimd -DSUMMITDEV_PGI") endif() #string(APPEND LDFLAGS " -Minline -ta=nvidia,cc80,cuda11.2,ptxinfo -Mcuda -Minfo=accel") string(APPEND LDFLAGS " -gpu=cc86,deepcopy -Minfo=accel") execute_process(COMMAND $ENV{NETCDF_FORTRAN_PATH}/bin/nf-config --flibs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) string(APPEND SLIBS " ${SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0} -L$ENV{PNETCDF_PATH}/lib -lpnetcdf -lhdf5_hl -lhdf5 -lnetcdf -lnetcdff -L/opt/nvidia/hpc_sdk/Linux_x86_64/21.9/compilers/lib -llapack -lblas") set(KOKKOS_OPTIONS "--with-cuda=$ENV{CUDA_DIR} --with-cuda-options=enable_lambda") execute_process(COMMAND $ENV{NETCDF_C_PATH}/bin/nc-config --libs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) execute_process(COMMAND $ENV{PNETCDF_PATH}/bin/pnetcdf-config --libs OUTPUT_VARIABLE SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0 OUTPUT_STRIP_TRAILING_WHITESPACE) string(APPEND SLIBS " ${SHELL_CMD_OUTPUT_BUILD_INTERNAL_IGNORE0}") set(SUPPORTS_CXX "TRUE") set(CXX_LINKER "FORTRAN") set(CXX_LIBS "-lstdc++") set(NETCDF_C_PATH "$ENV{NETCDF_C_PATH}") set(NETCDF_FORTRAN_PATH "$ENV{NETCDF_FORTRAN_PATH}") set(PNETCDF_PATH "$ENV{PNETCDF_PATH}")

lulu1599 avatar Mar 21 '23 07:03 lulu1599

My MPAS-G case was also failed with this error on Perlmutter(pm-gpu) using nvidiagpu compiler.

nvlink error : Multiple definition of '_mpas_timekeeping_21' in '../../mpas-framework/src/libocn.a:mpas_timekeeping.f90.o', first defined in 'CMakeFiles/e3sm.exe.dir/global/u2/y/youngsun/repos/github/E3SM/driver-mct/main/cime_comp_mod.F90.o' nvlink fatal : merge_elf failed

FYI, I used following newcase command line:

/global/homes/y/youngsun/repos/github/E3SM/cime/scripts/create_newcase --case "${CASEDIR}" --case-group mpaso-benchmark-2023 --res T62_oRRS18to6v3 --compset GMPAS-IAF --compiler nvidiagpu --machine pm-gpu --project m4259_g --output-root /global/cfs/cdirs/e3sm/youngsun/mpas/G-bench --walltime 02:00:00

grnydawn avatar Sep 06 '23 19:09 grnydawn

Sorry, this fell off my radar. I will take a look this week.

jgfouca avatar Sep 06 '23 19:09 jgfouca

I'm the NERSC support staff person that has this as a ticket. I'll offer to help if there is something needed from a facility perspective.

Larofeticus avatar Sep 06 '23 20:09 Larofeticus

This line in the description at the top:

It appears the recommendation is to create a different library for MPAS-framework gpu kernels and link it separately. Are you referring to the posting here? https://forums.developer.nvidia.com/t/nvlink-error-multiple-definition-errors-when-linking-to-the-same-library-twice/62892 It sounds like the build structure needs to be re-organized not to duplicate the contents of the libraries.

cponder avatar Sep 07 '23 08:09 cponder

I'm wondering if a flag like -Wl,--allow-multiple-definition might get it to tolerate the duplication between the libraries. I'm not sure if this would be the right way to express it. But the --allow-multiple-definition is a flag to "ld" that sounds like the problem you're seeing. Also, if this problem showed up on Perlmutter but not other systems, it could be that the default compiler/linker flags could be set differently.

cponder avatar Sep 07 '23 14:09 cponder

@cponder , thanks for the info. I searched the flag in the build log and found that the "-Wl,--allow-multiple-definition" flag is already used in the linker command line when e3sm.exe is created. I think the linker flag is defined in nvidiagpu_pm-gpu.cmake in cmake_macros folder.

grnydawn avatar Sep 07 '23 18:09 grnydawn

@philipwjones Any thoughts on how to resolve this?

sarats avatar Sep 11 '23 15:09 sarats

Hi all, apologies for delays on this. I am on perlmutter and I see the same error with case SMS.T62_oRRS18to6v3.GMPAS-IAF.pm-gpu_nvidiagpu. Working on a fix now.

jgfouca avatar Sep 11 '23 16:09 jgfouca

@cponder , It looks like -Wl,--allow-multiple-definition is already in the link line, so unfortunately that won't be the fix we need.

jgfouca avatar Sep 11 '23 16:09 jgfouca

@sarats - just back from vacation. As @amametjanov noted after this first came up, we had this issue a while ago but it went away - presumably as part of a compiler upgrade at some point. So we never really had to deal with it. I can't remember how the MPAS builds work in E3SM - I thought we were only building a single version of the framework (that's why mpas-framework is separate now) and then the MPAS components should be linking to that library. I could be wrong about that, but @jgfouca should know. We shouldn't need multiple instances of framework code.

philipwjones avatar Sep 11 '23 16:09 philipwjones

@philipwjones , I will try some different compiler versions because I am otherwise stumped.

The error is:

nvlink error   : Multiple definition of '_mpas_timekeeping_21' in '../../mpas-framework/src/libocn.a:mpas_timekeeping.f90.o', first defined in 'CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o'

But CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o doesn't define mpas_timekeeping:

% nm ./cmake/cpl/CMakeFiles/e3sm.exe.dir/pscratch/sd/a/acmetest/E3SM/driver-mct/main/cime_comp_mod.F90.o | grep -i mpas
%

jgfouca avatar Sep 11 '23 16:09 jgfouca

@jgfouca - I'm guessing it might get drawn into the cime_comp_mod via a nested include from the ocean model driver?

philipwjones avatar Sep 11 '23 16:09 philipwjones

OK, I have a fix. The problem seems to be the fact that I made the MPAS common library an OBJECT library, as opposed to a SHARED or STATIC library, and then had all the MPAS components link in all the common objects. I unfortunately have no recollection as to why I did things that way but it seems to work when I do things a more standard way, so we should do it the standard way unless we see a reason not to. PR coming.

jgfouca avatar Sep 11 '23 17:09 jgfouca