Kokkos::Impl::ParallelReduce< HIP > requested too large team size
I'm experimenting with stand-alone Homme on Frontier with Rocm 5.7.1 and 128 vertical levels, and my runs are failing with the following output.
Kokkos::Impl::ParallelReduce< HIP > requested too large team size
The core points to this line:
https://github.com/E3SM-Project/E3SM/blob/fff7243869f58856906c50d276023237ccd8a140/components/homme/src/theta-l_kokkos/cxx/CaarFunctorImpl.hpp#L350
I added some debug output, and I found that m_policy_pre has a team_size() of 16 and a impl_vector_length() of 64, or a total of 1024 threads. That value is indeed too big for the definition of m_policy_pre:
#ifndef NDEBUG
template<typename Tag>
using TeamPolicyType = Kokkos::TeamPolicy<ExecSpace,Kokkos::LaunchBounds<512,1>,Tag>;
#else
template<typename Tag>
using TeamPolicyType = Kokkos::TeamPolicy<ExecSpace,Tag>;
#endif
TeamPolicyType<TagPreExchange> m_policy_pre;
Notice the Kokkos::LaunchBounds<512,1>.
I don't know why this is only showing up now. Maybe a newer version of Kokkos or Rocm checks these settings more carefully? Regardless, I think we want to allow m_policy_pre to have 1024 threads (4x4x64), so I think Kokkos::LaunchBounds<512,1> should not be used on AMD GPUs, where warps are 64 instead of 32.