underworld2
underworld2 copied to clipboard
3D model: issue with HPC on Segmentation Violation
Dear UWGeodynamics team,
We installed UWG on our HPC (Cray XC40 with 36 cabinets) and we were able to run the tutorials, but we got the following error message when we tried to run 3D model:
[97]PETSC ERROR: ------------------------------------------------------------------------
[97]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[97]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[97]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[97]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[97]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
[97]PETSC ERROR: to get more information on the crash.
[97]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[97]PETSC ERROR: Signal received
[97]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[97]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021
[97]PETSC ERROR: 2.5D_north_150.py on a named nid00045 by x_aldaajt Tue Sep 6 06:48:35 2022
[364]PETSC ERROR: ------------------------------------------------------------------------
[364]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[364]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[364]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[364]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[364]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[364]PETSC ERROR: to get more information on the crash.
[364]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[364]PETSC ERROR: Signal received
[364]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[364]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021
[364]PETSC ERROR: 2.5D_north_150.py on a named nid00053 by x_aldaajt Tue Sep 6 06:48:35 2022
[97]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --
download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
[97]PETSC ERROR: #1 User provided function() at unknown file:0
[97]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
[364]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 -
-download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
[364]PETSC ERROR: #1 User provided function() at unknown file:0
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 97
[96]PETSC ERROR: ------------------------------------------------------------------------
[96]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[96]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[96]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[96]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[96]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[96]PETSC ERROR: to get more information on the crash.
[96]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[96]PETSC ERROR: Signal received
[96]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[96]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021
[96]PETSC ERROR: 2.5D_north_150.py on a named nid00045 by x_aldaajt Tue Sep 6 06:48:35 2022
[96]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --
download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
[96]PETSC ERROR: #1 User provided function() at unknown file:0
[96]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 96
[364]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash
Any suggestions?
Hi @totaibi,
How big is the 3D model, ie. the number of elements? Is the model in the repository?
How many cpus are you using, generally 3D model required high cpu counts?
Finally make sure you're not using the mumps
solver for 3D models. (It's a direct solve method and doesn't scale well in 3D).
It's encouraging that you could run the tutorials. Given this information I'd suggest the CPU resources are too low for the 3D model you're trying.
Hi @julesghub
The elementRes=(2048, 256, 512) as I'm using 1024 cpus with "mg" solver. Do you suggest another solver?
The model is not in the repository.
That's a huge model! I'd guess you're running out of memory but it's hard to say without knowing the RAM available on the HPC
Can you run some scaling tests to get an idea of the resource consumption of the model. By "scaling tests", I mean not full models time - just a few timesteps. I'd setup the tests by just dividing the elementRes sizes by 8,4,2... etc and see what works and what doesn't.
From the information try get an idea of how the model scales with CPU and memory resources.
I scaled down the resolution by factor of two and I did not get an error message, but the code running ver slow although I'm using 1024 cpus.
This is the output message I got after running the code for an 1:30 hr:
loaded rc file /opt/venv/lib/python3.9/site-packages/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
Global element size: 1024x128x256
Local offset of rank 0: 0x0x0
Local range of rank 0: 32x32x32
In func WeightsCalculator_CalculateAll(): for swarm "LM1Q1B5D__swarm"
done 33% (10923 cells)...
done 67% (21846 cells)...
done 100% (32768 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "LM1Q1B5D__swarm"
In SystemLinearEquations_NonLinearExecute
Non linear solver - iteration 0
Linear solver (HVQ5H720__system-execute)
Linear solver (HVQ5H720__system-execute), solution time 1.003619e+02 (secs)
Non linear solver - iteration 1
Linear solver (HVQ5H720__system-execute)
Linear solver (HVQ5H720__system-execute), solution time 1.047367e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.00078699 - Tolerance = 0.01
Non linear solver - Residual 7.86986499e-04; Tolerance 1.0000e-02 - Converged - 2.084862e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Converged after 1 iterations.
Ok, that's promising that it works - so it looks like a resource issue for the model. This can often happen with 3D models.
As a general rule use this line:
Local range of rank 0: 32x32x32
as a guide to measure each scaling test.
These numbers are the local amount of elements per CPU. 32^3 is generally heavy (without knowing the hardware or model parameters). Try add more CPUs, or cut the number of elements again to achieve around 16^3. Is there an improvement?
In doing some scaling tests you should be able to find a sweet spot for the given model.
I increased the cpus up to 4096 and got "Local range of rank 0: 32x16x16". It will be difficult to compromise lowering the resolution of our model more than what we already have. Would this scaling measure work?
Is there another way to enhance the scaling? ex. changing the solver?
Hi @julesghub,
After running the code (resolution: 1024, 128, 256) for one whole day with 8190 cpus, I got the following output (without any errors or output files):
loaded rc file /opt/venv/lib/python3.9/site-packages/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
Global element size: 1024x128x256
Local offset of rank 0: 0x0x0
Local range of rank 0: 16x16x16
In func WeightsCalculator_CalculateAll(): for swarm "EGRM9JL7__swarm"
done 33% (1366 cells)...
done 67% (2731 cells)...
done 100% (4096 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "EGRM9JL7__swarm"
In SystemLinearEquations_NonLinearExecute
Non linear solver - iteration 0
Linear solver (69JAIJ4T__system-execute)
Linear solver (69JAIJ4T__system-execute), solution time 1.016874e+02 (secs)
Non linear solver - iteration 1
Linear solver (69JAIJ4T__system-execute)
Linear solver (69JAIJ4T__system-execute), solution time 9.851009e+01 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.00078884 - Tolerance = 0.01
Non linear solver - Residual 7.88838200e-04; Tolerance 1.0000e-02 - Converged - 2.020923e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Converged after 1 iterations.
Here is my slurm file:
#!/bin/bash
#SBATCH --job-name=NRS
#SBATCH --nodes=256
#SBATCH --ntasks=8192
#SBATCH --hint=nomultithread
#SBATCH --time=24:00:00
#SBATCH --account=k1606
#SBATCH --output=2.5D_150.out
#SBATCH --error=2.5D_150.err
module load singularity
module swap PrgEnv-cray PrgEnv-gnu
module load cray-mpich-abi/7.7.18
export SINGULARITYENV_APPEND_PATH=/opt/venv/bin:${PATH}
export SINGULARITYENV_PYTHONPATH=/opt/venv/lib/python3.9/site-packages:${PYTHONPATH}
export myRepository=/project/k1606/UWcode/singularity/UWGeodynamics
export containerImage=$PWD/uwgeodynamics_latest.sif
workload=2.5D_north_150.py
time -p srun -n ${SLURM_NTASKS} --hint=nomultithread --mpi=pmi2 singularity run $containerImage python ${workload}
Is there something wrong with my .slurm file? How to ensure the code is running parallelly on the allocated cpus?
hi @julesghub
I am working with @totaibi to run this model.
The platform we are testing the workflow of @totaibi is CrayXC 40, with 2 socket Intel Haswell, 32 cores per node and 128GB memory.
I ran a scaling experiment on Tutorial 7: 3D Lithospheric Model for nsteps=1. This is a much smaller model than what @totaibi has. The strong scaling didn't look good.
Num_procs | payload per node | wall time | Average Total BSSCR Linear solve time |
---|---|---|---|
32 | 32x32x16 | 5337.01 | 77.51 |
64 | 32x16x16 | 5140.11 | 89.81 |
128 | 16x16x16 | > 18000 | 574.11 |
256 | 16x16x8 | > 18000 | 1015.82 |
The last two rows did not finish due to time limit of the SLURM job, but the trend is evident. |
Is this anticipated behavior? Regards Mohsin
Hi @mshaikh786 , Apologies for the huge delay in responding and thanks for logging the test results.
Unfortunately I don't understand the results, was it strong scaling or weak scaling? If it was weak scaling, be aware the discretisation resolution will change for the overall model. As this model has a non-linear rheology this will cause the solver behaviour between different weak scaled runs to be different as the spatial resolution of these non-linearities is different.
Assuming you did strong scaling.
The original model has a elementRes=(32, 32, 16)
so I imagine decomposing this problem on anything larger than 16 procs would incur large communication overhead, this would trump decomposition effectiveness.
For strong scaling of this model try num_procs
1, 4,8,12,16.
Dear all, if I understand correctly what Julian thinks is we run one simple 2D numerical model with element resolution (1024,256) on lets say 512 CPUs and see because the number of CPUs used in the above model is tooo much for the given resolution which is too small
@julesghub, I was able to run some scaling tests using scripts in https://github.com/underworldcode/scaling_scripts.git repository that was used to run these tests on Magnus supercomputer. Shaheen is Cray XC 40 as Magnus with 128 GB memory per node. Here is some background of the platform used for the scaling tests: The image I used was underworld2_latest converted to Singularity image file. I used singularity 3.9.4 to run the scaling test. For the MPI, I am using PMI 2 as a process manager (srun --mpi=pmi2) as this is the only way we can work with MPICH based containers on Shaheen. Though using PMI, my understanding is that the container native MPI is being using and not the host's MPICH from cray-mpich modulefile. For some reason the runs kept on segfaulting so I reduced the number of cores per node to half i.e. 16 per node. All the tests below are run in that configuration. Here are the results:
Strong scaling
Cores | Nodes | Global element size | Local range of rank 0 | Total Time (Runtime) sec |
---|---|---|---|---|
64 | 4 | 256x256x256 | 64x64x64 | 920.938 |
512 | 32 | 256x256x256 | 32x32x32 | 134.536 |
4096 | 256 | 256x256x256 | 16x16x16 | 92.543 |
Weak scaling
Cores | Nodes | Global element size | Local range of rank 0 | Total Time (Runtime) sec |
---|---|---|---|---|
64 | 4 | 256x256x256 | 64x64x64 | 917.01 |
512 | 32 | 512x512x512 | 64x64x64 | 1254.56 |
4096 | 256 | 1024x1024x1024 | 64x64x64 | Seg Fault |
The strong scaling looks ok for the problem size. I can try larger problem size with higher number of cores (to fulfill memory requirement). There are two questions I have regarding weak scaling though. The overhead is high when we go from 64 cores(4 nodes) to 512 cores (32 nodes). For some reason, the last job on 4096 cores keeps failing with the following error. I can't explain it because it is the same payload per node as the previous jobs. This may be related to communication time outs but I don't see a way to pinpoint to this cause.
The error in 4096 job of weak scaling is as follows ( its for every rank but I am only adding for an arbitrarily selected rank):
srun: error: nid05422: tasks 1312-1327: Killed
[1327]PETSC ERROR: ------------------------------------------------------------------------
[1327]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1327]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1327]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[1327]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[1327]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[1327]PETSC ERROR: to get more information on the crash.
[1327]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1327]PETSC ERROR: Signal received
[1327]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1327]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022
[1327]PETSC ERROR: timed_model.py on a named nid05422 by shaima0d Tue Oct 11 07:40:01 2022
[1327]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-petsc4py=1 --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --download-ctetgen --download-eigen --download-triangle --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
[1327]PETSC ERROR: #1 User provided function() at unknown file:0
[1327]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 1327 (rank 1327 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1327
Happy to work with you and have a virtual meeting on this if it is needed.
Hi @mshaikh786, Sorry for only responding now - thanks again for the results.
The error message you report is a general one for a memory corruption. Unfortunately I can't get a good idea of what's causing the issue.
I think you're right and a non-scalable overhead is causing the issue when job nodes is >256. Perhaps try 16^3 local range and 256 nodes?
cheers, J