E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Build errors in kokkos on pm-cpu

Open lxu16 opened this issue 1 year ago • 20 comments

I tried to build a E3SM-ChemUCI model on the perlmutter. I update the machine files and compiler files under $E3SM/cime_config/machines according to the v3atm branch below.

https://github.com/E3SM-Project/E3SM/tree/v3atm/eam/master_MAM5_wetaero/cime_config/machines

I met following error when building kokkos. Does anybody meet similar errors when building E3SM on pm-cpu? Any suggestions are welcome.

/global/u1/l/lix011/E3SM/externals/kokkos/core/src/Kokkos_Tuners.hpp:261:31: error: no member named 'sub_values' in 'Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, void>>'
    size_t index         = in.sub_values.size() * fraction_to_traverse;
                           ~~ ^
/global/u1/l/lix011/E3SM/externals/kokkos/core/src/Kokkos_Tuners.hpp:276:18: note: in instantiation of member function 'Kokkos::Tools::Experimental::Impl::GetMultidimensionalPoint<Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, void>>, double, double>::build' requested here
  return helper::build(in, std::get<Is>(indices).value.double_value...);
                 ^
/global/u1/l/lix011/E3SM/externals/kokkos/core/src/Kokkos_Tuners.hpp:225:30: error: no member named 'root_values' in 'Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, void>>'
    size_t index = dimension.root_values.size() * fraction_to_traverse;
                   ~~~~~~~~~ ^
/global/u1/l/lix011/E3SM/externals/kokkos/core/src/Kokkos_Tuners.hpp:263:45: note: in instantiation of member function 'Kokkos::Tools::Experimental::Impl::DimensionValueExtractor<Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, void>>>::get' requested here
        DimensionValueExtractor<node_type>::get(in, fraction_to_traverse));
                                            ^
/global/u1/l/lix011/E3SM/externals/kokkos/core/src/Kokkos_Tuners.hpp:276:18: note: in instantiation of member function 'Kokkos::Tools::Experimental::Impl::GetMultidimensionalPoint<Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, Kokkos::Tools::Experimental::Impl::ValueHierarchyNode<long, void>>, double, double>::build' requested here
  return helper::build(in, std::get<Is>(indices).value.double_value...);
                 ^
6 warnings and 2 errors generated.
gmake[2]: *** [core/src/CMakeFiles/kokkoscore.dir/build.make:93: core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:956: core/src/CMakeFiles/kokkoscore.dir/all] Error 2
gmake: *** [Makefile:149: all] Error 2

lxu16 avatar Aug 08 '23 20:08 lxu16

I formatted the code block.

Why are you using this specific branch? Could you please retry with the master branch? You're using an outdated branch and the code of interest has all been merged into master now.

mahf708 avatar Aug 09 '23 00:08 mahf708

The specific branch for the chemistry purpose was recommended by @tangq who is the developer of this branch.

lxu16 avatar Aug 09 '23 16:08 lxu16

@lxu16 , I spoke with @ndkeen , who is visiting our lab today, about the background of your study. Noel understands why you need to run the simulations with the branch instead of master. My understanding is that Noel has some ideas that would work, such as using the configuration files on master or maint-2.0 with the branch, or GNU instead of Intel.

tangq avatar Aug 09 '23 21:08 tangq

@tangq , thanks for helping clarify the modeling background behind this work.

lxu16 avatar Aug 09 '23 21:08 lxu16

For the error, is that with GNU or Intel compilers? I would think you should be able to use GNU compilers with your branch as this was working previously. Intel compiler was only added to maint-2.0/master recently as NERSC only installed them recently.

ndkeen avatar Aug 09 '23 21:08 ndkeen

I used the intel compiler, cmake -DCMAKE_CXX_COMPILER=/opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx. I will try gnu compiler to see how it goes.

lxu16 avatar Aug 09 '23 21:08 lxu16

@ndkeen Could you direct me the sample machine files and compiler option files I could adopt in order to use the GNU compiler? Thanks!

lxu16 avatar Aug 09 '23 22:08 lxu16

What I'm saying is that GNU should work as-is for that branch. It would be the default compiler.

ndkeen avatar Aug 09 '23 22:08 ndkeen

My current branch was adopted from the @tangq E3SM-ChemUCI_amip branch in Dec. 1. The commit is 5c1a8629027306b6da6a631b821654ccd29c444b.

https://github.com/E3SM-Project/E3SM/tree/tangq/atm/chemUCI_amip/cime_config/machines

This branch does not add the perlmutter machine files and compiler options yet.

lxu16 avatar Aug 09 '23 22:08 lxu16

To clarify, which branch are you talking about?

Can you clarify?

mahf708 avatar Aug 09 '23 23:08 mahf708

When I checkout tangq/atm/chemUCI_amip, the machine files are very old. Indeed, before pm-cpu or even perlmutter was added as a machine. The branch still uses config_machines.xml which is closer to maint-1.0.

ndkeen avatar Aug 09 '23 23:08 ndkeen

Overall, it may be much easier to see if you and/or Qi could put together a more recent branch ... even maint-2.0 doesn't currently work well with the intel compiler on pm-cpu without updating the scorpio modules, etc. --- so it's a hassle and much less performant :/

mahf708 avatar Aug 09 '23 23:08 mahf708

Sorry about the confusion. The tangq/atm/chemUCI_amip is the branch I used and I add some codes based on this version of Chem-UCI branch. It is dated back to Dec. 1, 2022.

When I checkout tangq/atm/chemUCI_amip, the machine files are very old. Indeed, before pm-cpu or even perlmutter was added as a machine. The branch still uses config_machines.xml which is closer to maint-1.0.

lxu16 avatar Aug 09 '23 23:08 lxu16

Which branch should I try to checkout and add changes to?

ndkeen avatar Aug 09 '23 23:08 ndkeen

BTW, I downloaded the newest master. It is compilable successfully on pm-cpu using the intel compiler.

lxu16 avatar Aug 09 '23 23:08 lxu16

tangq/atm/chemUCI_amip, The commit I used is 5c1a8629027306b6da6a631b821654ccd29c444b. I believe that is the version very close to the one merged into the E3SMv2.

Which branch should I try to checkout and add changes to?

lxu16 avatar Aug 09 '23 23:08 lxu16

I saw similar kokkos errors after updating the compiler and machine related files I wonder if anyone succeed in running the maint-2.1 branch (i.e., E3SMv2.1) using intel compiler on perlmutter.

lxu16 avatar Jan 24 '24 18:01 lxu16

The maint-2.1 branch seems fine to me on pm-cpu. I tested with a few tests, including e3sm_production.

ndkeen avatar Jan 24 '24 19:01 ndkeen

@ndkeen Could you share the runscript for the standard EAM compset test for the maint-2.1? I want to see if it works for the fresh cloned maint-2.1. Thanks!

The maint-2.1 branch seems fine to me on pm-cpu. I tested with a few tests, including e3sm_production.

lxu16 avatar Jan 24 '24 19:01 lxu16

I think maint-2.1 will work on pm-cpu before and after my recent PR to make some adjustments.

To run a test:

cd cime/scripts
create_test SMS_Ln5.ne4pg2_oQU480.F2010

for example

Here is the dir where you can find all of the tests I tried with maint-2.1: /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/m21up

ndkeen avatar Jan 24 '24 19:01 ndkeen