Stack frame hunt
While working on #3860 , we had a discussion with @psychocoderHPC and checked the stack frames produced when using 8th order (4 neighbors) FDTD and the corresponding incident field. Besides the usual suspects (RNG init, png output), there were 336 bytes stack frame in the FDTD kernel and 240 bytes stack frame for the incident field kernel, both with 0 bytes spill stores, 0 bytes spill loads. After looking a little bit into the implementation, we found out the constructor for AOFDTDWeights is actually not constexpr, and also the operator[] has a suspicious check which maybe also makes it not constexpr. So alltogether it is actually not clear what happens with these weights inside the FDTD kernel - are they recalculated each time, or stored in registers (or worse), or some combination of those.
cc @steindev
As investigated by @psychocoderHPC , it is maybe due to PML internals and unrelated to the AOFDTD implementation and we misattributed it due to forgetting FDTD and PML has the same kernel template. To be further investigated.
Edit: indeed it was the PML functor used, not the normal FDTD one
Commenting out this break which is optional there (the function works either way) doesn't reduge the stack frame value for the kernel, but seems to largely reduce the register use there. Replacing it with return makes matters worse in that regard, and replacing the range for loop with a C-style one doesn't change anything.
After some more investigation, the effect also depends on the CUDA version used. E.g. CUDA 11.0 and CUDA 11.4 show different kernels have non-zero stack frames for the same setup.
Some more places with stack frames
///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:99 auto crossedBoundary = pmacc::DataSpace<simDim>::create(0);
.loc 116 99 44, function_name $L__info_string842, inlined_at 113 74 29
///home/rwidera/workspace/picongpu/include/pmacc/../pmacc/dimensions/DataSpace.hpp:140 tmp[i] = value;
.loc 117 140 17, function_name $L__info_string602, inlined_at 116 99 44
mov.u32 %r354, 0;
st.local.u32 [%rd2], %r354;
st.local.u32 [%rd2+4], %r354;
st.local.u32 [%rd2+8], %r354;
$L__tmp9619:
///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:102 if(offsetToTotalOrigin[d] < m_parameters.beginInternalCellsTotalAllBoundaries[d])
.loc 116 102 53, function_name $L__info_string842, inlined_at 113 74 29
setp.lt.s32 %p5, %r15, %r91;
@%p5 bra $L__BB33_7;
bra.uni $L__BB33_4;
$L__BB33_7:
.loc 116 0 53
mov.u32 %r354, -1;
$L__tmp9620:
///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:103 crossedBoundary[d] = -1;
.loc 116 103 29, function_name $L__info_string842, inlined_at 113 74 29
st.local.u32 [%rd2], %r354;
bra.uni $L__BB33_8;
$L__tmp9621:
$L__BB33_4:
///home/rwidera/workspace/picongpu/include/pmacc/../picongpu/particles/boundary/Thermal.hpp:104 else if(offsetToTotalOrigin[d] >= m_parameters.endInternalCellsTotalAllBoundaries[d])
.loc 116 104 59, function_name $L__info_string842, inlined_at 113 74 29
setp.lt.s32 %p6, %r15, %r94;
@%p6 bra $L__BB33_6;
bra.uni $L__BB33_5;
With the current dev I observed stack frames in kernelMoveAndMark with the SPEC benchmark if we use the particle shape PQS
ptxas info : Compiling entry function '_ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_' for 'sm_70'
ptxas info : Function properties for _ZN6alpaka16uniform_cuda_hip6detail20uniformCudaHipKernelINS_12AccGpuCudaRtISt17integral_constantImLm3EEjEES5_jN5cupla16cupla_cuda_async11CuplaKernelIN8picongpu26KernelMoveAndMarkParticlesILj256EN5pmacc20SuperCellDescriptionINSC_4math2CT6VectorIN4mpl_10integral_cIiLi8EEESJ_NSI_IiLi4EEEEENSG_INSI_IiLi2EEESM_SM_EESN_EEEEEEJNSC_12ParticlesBoxINSC_5FrameINSC_15ParticlesBufferINSC_19ParticleDescriptionINSC_4meta6StringIJLc101EEEESL_N5boost3mpl6v_itemINSA_9weightingENS10_INSA_8momentumENS10_INSA_8positionINSA_12position_picENSC_13pmacc_isAliasEEENSZ_7vector0INSH_2naEEELi0EEELi0EEELi0EEENS10_INSA_11chargeRatioINSA_20ChargeRatioElectronsES15_EENS10_INSA_9massRatioINSA_18MassRatioElectronsES15_EENS10_INSA_7currentINSA_13currentSolver3EmZINSA_9particles6shapes3PQSENS1K_8strategy16CachedSupercellsEEES15_EENS10_INSA_13interpolationINSA_28FieldToParticleInterpolationIS1O_NSA_30AssignedTrilinearInterpolationEEES15_EENS10_INSA_5shapeIS1O_S15_EENS10_INSA_14particlePusherINS1M_6pusher5BorisES15_EES19_Li0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSC_17HandleGuardRegionINSC_9particles8policies17ExchangeParticlesENS2C_9DoNothingEEES19_S19_EESL_N8mallocMC9AllocatorIS6_NS2H_16CreationPolicies7ScatterINSA_16DeviceHeapConfigENS2J_11ScatterConf27DefaultScatterHashingParamsEEENS2H_20DistributionPolicies4NoopENS2H_11OOMPolicies10ReturnNullENS2H_19ReservePoolPolicies9AlpakaBufIS6_EENS2H_17AlignmentPolicies6ShrinkINS2W_12ShrinkConfig19DefaultShrinkConfigEEEEELj3EE29OperatorCreatePairStaticArrayILj256EEENSU_ISX_SL_NS10_INSC_9multiMaskENS10_INSC_12localCellIdxES1C_Li0EEELi0EEES29_S2F_S19_NS10_INSC_12NextFramePtrINSH_3argILi1EEEEENS10_INSC_16PreviousFramePtrIS3B_EES19_Li0EEELi0EEEEEEENS2H_19AllocatorHandleImplIS31_EELj3EEENSC_7DataBoxINSC_10PitchedBoxINSE_6VectorIfLi3ENSE_16StandardAccessorENSE_17StandardNavigatorENSE_6detail17Vector_componentsIfLi3EEEEELj3EEEEES3W_jNSA_20PushParticlePerFrameIS22_SL_S1W_EENSC_11AreaMappingILj3ENSC_18MappingDescriptionILj3ESL_EEEEEEEvNS_3VecIT0_T1_EET2_DpT3_
160 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 48 registers, 27664 bytes smem, 512 bytes cmem[0], 16 bytes cmem[2]
Here is some more information on why it is important to remove all stack frame usages: https://stackoverflow.com/a/7816434 It is not only about performance but stack frames will require some additional global memory at runtime. PIConGPU is by default only keeping 300MiB memory on the device free. If we execute a kernel that is using stack frames the result can be out of memory during runtime.
@sbastrakov @psychocoderHPC Any progress or plans for progress here?
There are still some kernels (e.g. boundary algorithms ) using stack frames we should fix. There is no fixed plan when it should be fixed.
@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.
@psychocoderHPC could you write here the commands to get the stack frames and registers information? Both for me as I've forgotten, and to document if someone else will need it.
pic-build -f -c "-Dalpaka_CUDA_SHOW_REGISTER=ON -Dalpaka_CUDA_KEEP_FILES=ON -Dalpaka_CUDA_SHOW_CODELINES=ON" 2>&1 | tee reg.txt