picongpu
picongpu copied to clipboard
Out-of-memory in multiple GPU mode, ROCm 4.3.1 on AMD MI 100 GPUs
Hi
I am able to run PicOnGPU (dev branch) on our AMD MI 100 GPUs cluster but only on single GPU mode.
As soon as i try to run the code in multiple GPU mode with more MPI tasks, the PicOnGPU process is killed by the OS and
the slurm scheduler report an Out Of Memory errors:
slurmstepd: error: Detected 74 oom-kill event(s) in step 36403717.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: lxbk1122: task 0: Out Of Memory
And the out of memory failure always happened just after the program initialisation:
PIConGPU: 0.7.0-dev
Build-Type: Release
Third party:
OS: Linux-3.10.0-1160.31.1.el7.x86_64
arch: x86_64
CXX: Clang (13.0.0)
CMake: 3.20.5
Boost: 1.75.0
MPI:
standard: 3.1
flavor: OpenMPI (4.0.3)
PNGwriter: 0.7.0
openPMD: 0.14.3
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00229 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
Estimates are based on DensityRatio to BASE_DENSITY of each species
(see: density.param, speciesDefinition.param).
It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.0247974)
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 23592960
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 360000 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 360000 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0 keV and corresponding Debye length 0 m.
The grid has 0 cells per average Debye length
initialization time: 6sec 281msec = 6.281 sec
This is the GPU mapping i used to submit picongpu
#SBATCH -J pog_1
#SBATCH -o /lustre/rz/dbertini/gpu/data/lwfa_002/pog_1_%j.out
#SBATCH -e /lustre/rz/dbertini/gpu/data/lwfa_002/pog_1_%j.err
#SBATCH -D /lustre/rz/dbertini/gpu/data/lwfa_002/
#SBATCH --partition gpu
#SBATCH --gres=gpu:8 # number of GPUs per node
#SBATCH -t 7-00:00:00
#SBATCH --nodes=1 # nb of nodes
#SBATCH --ntasks=8 # nb of MPI tasks
#SBATCH --cpus-per-task=4 #CPU core per MPI processes
#SBATCH --gpu-bind=closest
#SBATCH --mem=64G
and the options used for picongpu corresponding to this mapping are:
/lustre/rz/dbertini/gpu/data/lwfa_002/input/bin/picongpu -d 2 4 1 -g 192 2048 240 -s 4000 -m --windowMovePoint 0.9 --e_png.period 100 --e_png.axis yx --e_png.slicePoint 0.5 --e_png.folder pngElectronsYX --e_energyHistogram.period 100 --e_energyHistogram.binCount 1024 --e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000 --e_energyHistogram.filter all --e_phaseSpace.period 100 --e_phaseSpace.space y --e_phaseSpace.momentum py --e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0 --e_phaseSpace.filter all --e_macroParticlesCount.period 100 --openPMD.period 100 --openPMD.file simData --openPMD.ext bp --checkpoint.backend openPMD --checkpoint.period 100 --versionOnce | tee output
Something seems to be wrong in the definition of this mapping. Any ideas what could be wrong here ?
Hello @denisbertini .
So PIConGPU seems to be starting and crashing soon after initialization starts. However particles are generated and Debye length involves a kernel, so it's not literally the first memory allocation or kernel launch that fails. Could you try increasing reserve memory size here.
This error at this point sounds strangely familiar, but I couldn't find it right away.
I should change the line
constexpr size_t reservedGpuMemorySize = 350 * 1024 * 1024;
but what value?
This is size that picongpu leaves free on each GPU. For sake of testing, please try a very small grid size in your .cfg file and leave e.g. 1 GB per GPU, so 1024 * 1024 * 1024
Note that you need to rebuild after changing that file, same as any other .param.
Also your #SBATCH --mem=64G appears way too low. As far as I can see, it is for the whole node with 8 GPUs. Normally with PIConGPU allocated host memory size should be at least same size as memory of all used GPUs of a node combined. In case there is relatively little host memory on a system, could you also try using fewer GPUs to check if this is problematic.
Now thinking of it, 64 GB host memory requested may be causing this issue.
We always allocate all GPU memory except that reservedGpuMemorySize regardless of actual simulation size. And host generally needs at least same amount of memory per each GPU, so that host-device buffers can exist. So with 8 MI 100 GPUs one needs 8 x 32 GB host memory I guess? Or use fewer GPUs to match the host memory size
I may remember an early issue on AMD which looked like only 50% of GPU memory could be allocated. Was it this one you remembered @sbastrakov ?
Ah yes, that one! So now two independent things to investigate: host memory allocated size and device one
So
- I changed
reservedGpuMemorySizeto1Gand rebuild. still thepicongpuprocess run out of memory.
Then i increased the available memory per node. - host memory to
128Ginstead of64Gand now it seems to work for 2 MPI processes with a relatively small grid size - grid size:
48 96 48But i am still confused by the GPU Mapping for PiconGPU. I adapted the template i found in picongpu/etc/spock-ornlsince the hardware setup is more or less the same and the script use sbatch as a scheduler. I modified thetplfile though in order to launch picongpu within a singularity container. And it works fine for me now. (BTW we could add this the your list of setup example inpicongpu/etc. It could help users which want to usedockerorsingularityto runpicongpu.) So now i can run with different configuration1.cfg,2.cfgand i tried also4.cfgall on one node for the moment. My sbatch definition in my TBG template.tblis the following:
#SBATCH --partition=!TBG_queue
#SBATCH --time=!TBG_wallTime
# Sets batch job's name
#SBATCH --job-name=!TBG_jobName
#SBATCH --nodes=!TBG_nodes # Nb of nodes
#SBATCH --ntasks=!TBG_tasks # Nb of MPI tasks
#SBATCH --gres=gpu:!TBG_tasks
# SBATCH --ntasks-per-node=!TBG_devicesPerNode
# #SBATCH --mincpus=!TBG_mpiTasksPerNode
# #SBATCH --cpus-per-task=!TBG_coresPerGPU # CPU Cores per MPI process
#SBATCH --mem=128G # Requested Total Job Memory / Node
# #SBATCH --mem-per-gpu=!TBG_memPerDevice
# #SBATCH --gpu-bind=closest
#SBATCH --mail-type=!TBG_mailSettings
#SBATCH --mail-user=!TBG_mailAddress
#SBATCH --chdir=!TBG_dstPath
#SBATCH -o pog_%j.out
#SBATCH -e pog_%j.err
You see that i added there the allocation via --gres=gpu:n_gpus and commented some other options that i do not use for the GPU mapping defintiion.
Using such a defintion i run such job on our cluster ( 4.cfg config is used)
obName : lwfa_015 singularity
Submit : 2022-01-11 14:29:39 2022-01-11 14:29:41
Start : 2022-01-11 14:29:40 2022-01-11 14:29:41
End : Unknown Unknown
UserCPU : 00:00:00 00:00:00
TotalCPU : 00:00:00 00:00:00
JobID : 36453737 36453737.0
JobIDRaw : 36453737 36453737.0
JobName : lwfa_015 singularity
Partition : gpu
NTasks : 4
AllocCPUS : 64 4
Elapsed : 00:16:37 00:16:36
State : RUNNING RUNNING
ExitCode : 0:0 0:0
AveCPUFreq : 0
ReqCPUFreqMin : Unknown Unknown
ReqCPUFreqMax : Unknown Unknown
ReqCPUFreqGov : Unknown Unknown
ReqMem : 128Gn 128Gn
ConsumedEnergy : 0
AllocGRES : gpu:4 gpu:4
ReqGRES : gpu:0 gpu:0
ReqTRES : billing=4+
AllocTRES : billing=1+ cpu=4,gre+
TotalReqMem : 512 GB 512 GB
Here you see that the 4 GPus are allocated and that there is indeed 4 tasks and 64 allocated CPUS.
My questions:
- Does this output sounds correct to you?
- Should the options i commented be added ?
- Is my asumption 1 MPI tasks per GPU device correct?
- to run in multi nodes, should i just increase now the number of tasks? if yes i can not use
--gresoption anymore... is such a option relevant forpicongpu?
Another question, how to define the macro_particle/cell number in picongpu and what is the default used?
Another problem, trying to run with the std 4.cfg definition file i got a crash in openMPI:
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20216] *** An error occurred in MPI_Isend
[lxbk1122:20216] *** reported by process [4080535318,3]
[lxbk1122:20216] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:20216] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:20216] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:20216] *** and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 36455560.0 ON lxbk1122 CANCELLED AT 2022-01-11T15:13:17 ***
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:20217] *** An error occurred in MPI_Isend
[lxbk1122:20217] *** reported by process [4080535318,1]
[lxbk1122:20217] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
any idea what is the problem with the transport?
For openMPI i used the recommended options
# setup openMPI
export PMIX_MCA_gds=^ds21
export OMPI_MCA_io=^ompio
export OMPI_MCA_mpi_leave_pinned=0
export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_rdma_pipeline_send_length=100000000
export OMPI_MCA_btl_openib_rdma_pipeline_frag_size=100000000
forget about my last MPI noise, the proper openib settings needed to be added !
well when i increase the grid siz, i still get the openMPI crash:
Message size 1223060560 bigger than supported by selected transport. Max = 1073741824
[lxbk1122:24195] *** An error occurred in MPI_Isend
[lxbk1122:24195] *** reported by process [3365471779,2]
[lxbk1122:24195] *** on communicator MPI COMMUNICATOR 16 SPLIT_TYPE FROM 15
[lxbk1122:24195] *** MPI_ERR_OTHER: known error not in list
[lxbk1122:24195] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lxbk1122:24195] *** and potentially your MPI job)
Is there a way to overcome this MPI limit ?
Thanks for a detailed description. Let me reply to your points separately
I changed reservedGpuMemorySize to 1G and rebuild. still the picongpu process run out of memory.
So i think reservedGpuMemorySize is now cleared of suspicion and can be reverted to our default value.
host memory to 128G instead of 64G and now it seems to work for 2 MPI processes with a relatively small grid size
Okay, so the host memory size may be the issue. As i mentioned above, for our memory allocation size grid size should not matter as we try to take all but reservedGpuMemorySize for any simulation. I wanted to try small size first just to not worry about this.
picongpu/etc/spock-ornl since the hardware setup is more or less the same and the script use sbatch as a scheduler. I modified the tpl file though in order to launch picongpu within a singularity container. And it works fine for me now.
Makes sense.
(BTW we could add this the your list of setup example in picongpu/etc. It could help users which want to use docker or singularity to run picongpu.)
We have some docs about docker here. But sure singularity use case should be documented as well, and perhaps also some general info about using PIConGPU with containers can be added.
You see that i added there the allocation via --gres=gpu:n_gpus and commented some other options that i do not use for > the GPU mapping defintiion.
In my experience, SLURM can be configured differently on different systems, e.g. which subset of its redundant set of variables is used. This is one of reasons we try to isolate it in .tpl files. Of course, it comes at a price that the first user on a system has to figure it our and set it up.
Is my asumption 1 MPI tasks per GPU device correct
Correct, this is only mode we support (or to be more precise, we run 1 MPI process on what is exposed as a GPU for the job). Your general output seems okay to me. I am no expert and generally, again, the right subset of SLURM veriables needs to be figured out for each system.
to run in multi nodes, should i just increase now the number of tasks? if yes i can not use --gres option anymore... is such a option relevant for picongpu?
Yes, increase the number of tasks, the number of nodes will be calculated from it by contents of the .tpl file. I am not sure what is a problem with --gres, it is requirements per node. Again, what you have in the .tpl should already manage it properly, and if there is a problem we can adjust for it.
Could you also attach your current .tpl version so that we are on the same page?
Another question, how to define the macro_particle/cell number in picongpu and what is the default used?
It is set by a user. The naming and location is admittedly not obvious and can be improved. To use our LWFA example, when initializing a species, usually constructs like this are used. The second template parameter, in that case startPosition::Random2ppc defines both number of macroparticles per cell and how they are initially distributes inside a cell. We try to name this type accordingly, but it is merely an alias. It is defined in particle.param, for the LWFA example here. By changing numParticlesPerCell there you can control ppc. There is also a doc page on macroparticle sampling here.
I just modified again the famous "system dependent" SLURM variables and now i am able to run picongpu with the full 8 GPUs on one node.
It seems to run well if the grid size is adjusted not to trigger the openMPI crash i mentioned above.
I attach also my current template .tpl
virgo.tpl.txt
BTW feel free to correct/change things in the template
Could you also attach your .cfg file that triggers that openMPI error? So that we can figure our if PIConGPU should be sending this amount of data at all
Just as a stupid, but quick, things to try. We normally only use export OMPI_MCA_io=^ompio and no other settings for openMPI in profiles. Does the issue persist for this case?
the cfg that trigger the crash in openMPI is the following:
# Copyright 2013-2021 Axel Huebl, Rene Widera, Felix Schmitt, Franz Poeschel
#
# This file is part of PIConGPU.
#
# PIConGPU is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# PIConGPU is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with PIConGPU.
# If not, see <http://www.gnu.org/licenses/>.
#
##
## This configuration file is used by PIConGPU's TBG tool to create a
## batch script for PIConGPU runs. For a detailed description of PIConGPU
## configuration files including all available variables, see
##
## docs/TBG_macros.cfg
##
#################################
## Section: Required Variables ##
#################################
TBG_wallTime="2:00:00"
TBG_devices_x=2
TBG_devices_y=4
TBG_devices_z=1
#TBG_gridSize="192 2048 160"
TBG_gridSize="192 1024 160"
TBG_steps="4000"
# leave TBG_movingWindow empty to disable moving window
TBG_movingWindow="-m --windowMovePoint 0.9"
#################################
## Section: Optional Variables ##
#################################
# png image output (rough electron density and laser preview)
TBG_pngYX="--e_png.period 100 \
--e_png.axis yx --e_png.slicePoint 0.5 \
--e_png.folder pngElectronsYX"
# energy histogram (electrons, [keV])
TBG_e_histogram="--e_energyHistogram.period 100 \
--e_energyHistogram.binCount 1024 \
--e_energyHistogram.minEnergy 0 --e_energyHistogram.maxEnergy 1000 \
--e_energyHistogram.filter all"
# longitudinal phase space (electrons, [m_e c])
TBG_e_PSypy="--e_phaseSpace.period 100 \
--e_phaseSpace.space y --e_phaseSpace.momentum py \
--e_phaseSpace.min -1.0 --e_phaseSpace.max 1.0 \
--e_phaseSpace.filter all"
TBG_openPMD="--openPMD.period 100 \
--openPMD.file simData \
--openPMD.ext bp \
--checkpoint.backend openPMD \
--checkpoint.period 100
--checkpoint.restart.backend openPMD"
# macro particle counter (electrons, debug information for memory)
TBG_e_macroCount="--e_macroParticlesCount.period 100"
TBG_plugins="!TBG_pngYX \
!TBG_e_histogram \
!TBG_e_PSypy \
!TBG_e_macroCount \
!TBG_openPMD"
#################################
## Section: Program Parameters ##
#################################
TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"
TBG_programParams="-d !TBG_deviceDist \
-g !TBG_gridSize \
-s !TBG_steps \
!TBG_movingWindow \
!TBG_plugins \
--versionOnce"
# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"
"$TBG_cfgPath"/submitAction.sh
i just commented the grid size generating the crash and reduce the Y_dim1 by factor 2 to oversome the openMPI limitation
The output of the simulation using the modified 8.cfg seems to be ok:
Running program...
PIConGPU: 0.7.0-dev
Build-Type: Release
Third party:
OS: Linux-3.10.0-1160.31.1.el7.x86_64
arch: x86_64
CXX: Clang (13.0.0)
CMake: 3.20.5
Boost: 1.75.0
MPI:
standard: 3.1
flavor: OpenMPI (4.0.3)
PNGwriter: 0.7.0
openPMD: 0.14.3
PIConGPUVerbose PHYSICS(1) | Sliding Window is ON
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00229 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
Estimates are based on DensityRatio to BASE_DENSITY of each species
(see: density.param, speciesDefinition.param).
It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.0247974)
PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
PIConGPUVerbose PHYSICS(1) | macro particles per device: 7864320
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 117120 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 117120 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0 keV and corresponding Debye length 0 m.
The grid has 0 cells per average Debye length
initialization time: 4sec 332msec = 4.332 sec
0 % = 0 | time elapsed: 17sec 449msec | avg time per step: 0msec
5 % = 200 | time elapsed: 31sec 758msec | avg time per step: 14msec
10 % = 400 | time elapsed: 58sec 329msec | avg time per step: 13msec
15 % = 600 | time elapsed: 1min 26sec 395msec | avg time per step: 16msec
20 % = 800 | time elapsed: 1min 56sec 448msec | avg time per step: 19msec
25 % = 1000 | time elapsed: 2min 26sec 663msec | avg time per step: 28msec
30 % = 1200 | time elapsed: 2min 59sec 763msec | avg time per step: 23msec
35 % = 1400 | time elapsed: 3min 29sec 455msec | avg time per step: 25msec
40 % = 1600 | time elapsed: 4min 1sec 669msec | avg time per step: 25msec
45 % = 1800 | time elapsed: 4min 33sec 885msec | avg time per step: 27msec
50 % = 2000 | time elapsed: 5min 6sec 221msec | avg time per step: 33msec
55 % = 2200 | time elapsed: 5min 38sec 681msec | avg time per step: 29msec
60 % = 2400 | time elapsed: 6min 12sec 680msec | avg time per step: 28msec
65 % = 2600 | time elapsed: 6min 44sec 568msec | avg time per step: 26msec
70 % = 2800 | time elapsed: 7min 14sec 657msec | avg time per step: 22msec
75 % = 3000 | time elapsed: 7min 44sec 405msec | avg time per step: 28msec
80 % = 3200 | time elapsed: 8min 12sec 640msec | avg time per step: 22msec
85 % = 3400 | time elapsed: 8min 41sec 251msec | avg time per step: 21msec
90 % = 3600 | time elapsed: 9min 17sec 513msec | avg time per step: 21msec
95 % = 3800 | time elapsed: 9min 46sec 544msec | avg time per step: 21msec
100 % = 4000 | time elapsed: 10min 14sec 462msec | avg time per step: 20msec
calculation simulation time: 10min 26sec 649msec = 626.649 sec
full simulation time: 10min 31sec 765msec = 631.765 sec
Since i do not have any performance reference, is the wall_time of 10min 31 sec ok for such a simulation ?
Thanks I will now do some back of the envelope estimates of how communication should happen then.
For performance, i do not have a reference in my head for LWFA. You can try running our benchmark setup for which we have an idea how it should perform
OK thanks !
Do you know also if there is some documentation related to this LWFA example?
If you mean specifically for LWFA, there is only a small doc page here. In case there are some physics questions, my colleagues could help (i am a computer scientist).
OK thanks a lot !
BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !
So for that .cfg file there is no way PIConGPU should be attempting sending with message size 1223060560. It could either be a result of some error in PIConGPU or an issue of OpenMPI or a misreported issue. With all 3 it's weird how did we never see it before. To investigate further, you could rebuild PIConGPU in debug mode like described here with 127 for PIC_VERBOSE and PMACC_VERBOSE, run and attach stdout and stderr. Then we may be able to get the message sizes PIConGPU requested to send.
BTW this is exactly the hardware setup we have, just double GPU/RAM/Cores. setup I was trying to use it to find out optimal options for SLURM, but if you can help, i would appreciate !
Ideally, your system documentation or admin should have the recommended ways of submitting jobs. We normally start from there when setting PIConGPU on a new system, and then adjust / make support tickets when something does not work (of course, depending on IT infrastructure and workpower). In case there is none, that linked docs of a similar system is a good start. I think generally if one has some working configuration that allows running jobs, it's then most reasonable to make sure MPI and all needed dependencies (openPMD API etc.) work fine. The .tpl file can be refined later as well.
Sure, and i think i will discover more things along the way