underworld2 3D model: issue with HPC on Segmentation Violation

Dear UWGeodynamics team,

We installed UWG on our HPC (Cray XC40 with 36 cabinets) and we were able to run the tutorials, but we got the following error message when we tried to run 3D model:

[97]PETSC ERROR: ------------------------------------------------------------------------

[97]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

[97]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger

[97]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind

[97]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors

[97]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run

srun: Job step aborted: Waiting up to 302 seconds for job step to finish.

[97]PETSC ERROR: to get more information on the crash.

[97]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

[97]PETSC ERROR: Signal received

[97]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[97]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021

[97]PETSC ERROR: 2.5D_north_150.py on a  named nid00045 by x_aldaajt Tue Sep  6 06:48:35 2022

[364]PETSC ERROR: ------------------------------------------------------------------------

[364]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

[364]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger

[364]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind

[364]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors

[364]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run

[364]PETSC ERROR: to get more information on the crash.

[364]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

[364]PETSC ERROR: Signal received

[364]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[364]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021

[364]PETSC ERROR: 2.5D_north_150.py on a  named nid00053 by x_aldaajt Tue Sep  6 06:48:35 2022

[97]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --

download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8

[97]PETSC ERROR: #1 User provided function() at unknown file:0

[97]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.

[364]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 -

-download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8

[364]PETSC ERROR: #1 User provided function() at unknown file:0

application called MPI_Abort(MPI_COMM_WORLD, 59) - process 97

[96]PETSC ERROR: ------------------------------------------------------------------------

[96]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range

[96]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger

[96]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind

[96]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors

[96]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run

[96]PETSC ERROR: to get more information on the crash.

[96]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

[96]PETSC ERROR: Signal received

[96]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[96]PETSC ERROR: Petsc Release Version 3.16.1, Nov 01, 2021

[96]PETSC ERROR: 2.5D_north_150.py on a  named nid00045 by x_aldaajt Tue Sep  6 06:48:35 2022

[96]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --

download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8

[96]PETSC ERROR: #1 User provided function() at unknown file:0

[96]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.

application called MPI_Abort(MPI_COMM_WORLD, 59) - process 96

[364]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash

Any suggestions?

Sep 07 '22 04:09 totaibi

Hi @totaibi, How big is the 3D model, ie. the number of elements? Is the model in the repository? How many cpus are you using, generally 3D model required high cpu counts? Finally make sure you're not using the mumps solver for 3D models. (It's a direct solve method and doesn't scale well in 3D).

It's encouraging that you could run the tutorials. Given this information I'd suggest the CPU resources are too low for the 3D model you're trying.

Sep 07 '22 22:09 julesghub

Hi @julesghub

The elementRes=(2048, 256, 512) as I'm using 1024 cpus with "mg" solver. Do you suggest another solver?

The model is not in the repository.

Sep 08 '22 03:09 totaibi

That's a huge model! I'd guess you're running out of memory but it's hard to say without knowing the RAM available on the HPC

Can you run some scaling tests to get an idea of the resource consumption of the model. By "scaling tests", I mean not full models time - just a few timesteps. I'd setup the tests by just dividing the elementRes sizes by 8,4,2... etc and see what works and what doesn't.

From the information try get an idea of how the model scales with CPU and memory resources.

Sep 08 '22 07:09 julesghub

I scaled down the resolution by factor of two and I did not get an error message, but the code running ver slow although I'm using 1024 cpus.

This is the output message I got after running the code for an 1:30 hr:

loaded rc file /opt/venv/lib/python3.9/site-packages/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
	Global element size: 1024x128x256
	Local offset of rank 0: 0x0x0
	Local range of rank 0: 32x32x32
In func WeightsCalculator_CalculateAll(): for swarm "LM1Q1B5D__swarm"
	done 33% (10923 cells)...
	done 67% (21846 cells)...
	done 100% (32768 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "LM1Q1B5D__swarm"
In SystemLinearEquations_NonLinearExecute

Non linear solver - iteration 0
Linear solver (HVQ5H720__system-execute)
Linear solver (HVQ5H720__system-execute), solution time 1.003619e+02 (secs)
Non linear solver - iteration 1
Linear solver (HVQ5H720__system-execute)
Linear solver (HVQ5H720__system-execute), solution time 1.047367e+02 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.00078699 - Tolerance = 0.01
Non linear solver - Residual 7.86986499e-04; Tolerance 1.0000e-02 - Converged - 2.084862e+02 (secs)

In func SystemLinearEquations_NonLinearExecute: Converged after 1 iterations.

Sep 08 '22 11:09 totaibi

Ok, that's promising that it works - so it looks like a resource issue for the model. This can often happen with 3D models.

As a general rule use this line: Local range of rank 0: 32x32x32 as a guide to measure each scaling test. These numbers are the local amount of elements per CPU. 32^3 is generally heavy (without knowing the hardware or model parameters). Try add more CPUs, or cut the number of elements again to achieve around 16^3. Is there an improvement?

In doing some scaling tests you should be able to find a sweet spot for the given model.

Sep 09 '22 04:09 julesghub

I increased the cpus up to 4096 and got "Local range of rank 0: 32x16x16". It will be difficult to compromise lowering the resolution of our model more than what we already have. Would this scaling measure work?

Is there another way to enhance the scaling? ex. changing the solver?

Sep 09 '22 10:09 totaibi

Hi @julesghub,

After running the code (resolution: 1024, 128, 256) for one whole day with 8190 cpus, I got the following output (without any errors or output files):

loaded rc file /opt/venv/lib/python3.9/site-packages/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
	Global element size: 1024x128x256
	Local offset of rank 0: 0x0x0
	Local range of rank 0: 16x16x16
In func WeightsCalculator_CalculateAll(): for swarm "EGRM9JL7__swarm"
	done 33% (1366 cells)...
	done 67% (2731 cells)...
	done 100% (4096 cells)...
WeightsCalculator_CalculateAll(): finished update of weights for swarm "EGRM9JL7__swarm"
In SystemLinearEquations_NonLinearExecute

Non linear solver - iteration 0
Linear solver (69JAIJ4T__system-execute)
Linear solver (69JAIJ4T__system-execute), solution time 1.016874e+02 (secs)
Non linear solver - iteration 1
Linear solver (69JAIJ4T__system-execute)
Linear solver (69JAIJ4T__system-execute), solution time 9.851009e+01 (secs)
In func SystemLinearEquations_NonLinearExecute: Iteration 1 of 500 - Residual 0.00078884 - Tolerance = 0.01
Non linear solver - Residual 7.88838200e-04; Tolerance 1.0000e-02 - Converged - 2.020923e+02 (secs)

In func SystemLinearEquations_NonLinearExecute: Converged after 1 iterations.

Here is my slurm file:

#!/bin/bash
#SBATCH --job-name=NRS
#SBATCH --nodes=256
#SBATCH --ntasks=8192
#SBATCH --hint=nomultithread
#SBATCH --time=24:00:00
#SBATCH --account=k1606
#SBATCH --output=2.5D_150.out
#SBATCH --error=2.5D_150.err
module load singularity
module swap PrgEnv-cray PrgEnv-gnu
module load cray-mpich-abi/7.7.18
export SINGULARITYENV_APPEND_PATH=/opt/venv/bin:${PATH}
export SINGULARITYENV_PYTHONPATH=/opt/venv/lib/python3.9/site-packages:${PYTHONPATH}
export myRepository=/project/k1606/UWcode/singularity/UWGeodynamics
export containerImage=$PWD/uwgeodynamics_latest.sif
workload=2.5D_north_150.py
time -p srun -n ${SLURM_NTASKS}  --hint=nomultithread --mpi=pmi2 singularity run $containerImage python ${workload}

Is there something wrong with my .slurm file? How to ensure the code is running parallelly on the allocated cpus?

Sep 10 '22 10:09 totaibi

hi @julesghub I am working with @totaibi to run this model. The platform we are testing the workflow of @totaibi is CrayXC 40, with 2 socket Intel Haswell, 32 cores per node and 128GB memory. I ran a scaling experiment on Tutorial 7: 3D Lithospheric Model for nsteps=1. This is a much smaller model than what @totaibi has. The strong scaling didn't look good.

Num_procs	payload per node	wall time	Average Total BSSCR Linear solve time
32	32x32x16	5337.01	77.51
64	32x16x16	5140.11	89.81
128	16x16x16	> 18000	574.11
256	16x16x8	> 18000	1015.82
The last two rows did not finish due to time limit of the SLURM job, but the trend is evident.

Is this anticipated behavior? Regards Mohsin

Sep 19 '22 07:09 mshaikh786

Hi @mshaikh786 , Apologies for the huge delay in responding and thanks for logging the test results.

Unfortunately I don't understand the results, was it strong scaling or weak scaling? If it was weak scaling, be aware the discretisation resolution will change for the overall model. As this model has a non-linear rheology this will cause the solver behaviour between different weak scaled runs to be different as the spatial resolution of these non-linearities is different.

Assuming you did strong scaling. The original model has a elementRes=(32, 32, 16) so I imagine decomposing this problem on anything larger than 16 procs would incur large communication overhead, this would trump decomposition effectiveness. For strong scaling of this model try num_procs 1, 4,8,12,16.

Sep 30 '22 03:09 julesghub

Dear all, if I understand correctly what Julian thinks is we run one simple 2D numerical model with element resolution (1024,256) on lets say 512 CPUs and see because the number of CPUs used in the above model is tooo much for the given resolution which is too small

Oct 03 '22 07:10 HanyMKhalil

@julesghub, I was able to run some scaling tests using scripts in https://github.com/underworldcode/scaling_scripts.git repository that was used to run these tests on Magnus supercomputer. Shaheen is Cray XC 40 as Magnus with 128 GB memory per node. Here is some background of the platform used for the scaling tests: The image I used was underworld2_latest converted to Singularity image file. I used singularity 3.9.4 to run the scaling test. For the MPI, I am using PMI 2 as a process manager (srun --mpi=pmi2) as this is the only way we can work with MPICH based containers on Shaheen. Though using PMI, my understanding is that the container native MPI is being using and not the host's MPICH from cray-mpich modulefile. For some reason the runs kept on segfaulting so I reduced the number of cores per node to half i.e. 16 per node. All the tests below are run in that configuration. Here are the results:

Strong scaling

Cores	Nodes	Global element size	Local range of rank 0	Total Time (Runtime) sec
64	4	256x256x256	64x64x64	920.938
512	32	256x256x256	32x32x32	134.536
4096	256	256x256x256	16x16x16	92.543

Weak scaling

Cores	Nodes	Global element size	Local range of rank 0	Total Time (Runtime) sec
64	4	256x256x256	64x64x64	917.01
512	32	512x512x512	64x64x64	1254.56
4096	256	1024x1024x1024	64x64x64	Seg Fault

The strong scaling looks ok for the problem size. I can try larger problem size with higher number of cores (to fulfill memory requirement). There are two questions I have regarding weak scaling though. The overhead is high when we go from 64 cores(4 nodes) to 512 cores (32 nodes). For some reason, the last job on 4096 cores keeps failing with the following error. I can't explain it because it is the same payload per node as the previous jobs. This may be related to communication time outs but I don't see a way to pinpoint to this cause.

The error in 4096 job of weak scaling is as follows ( its for every rank but I am only adding for an arbitrarily selected rank):

srun: error: nid05422: tasks 1312-1327: Killed
[1327]PETSC ERROR: ------------------------------------------------------------------------
[1327]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[1327]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1327]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[1327]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[1327]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[1327]PETSC ERROR: to get more information on the crash.
[1327]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1327]PETSC ERROR: Signal received
[1327]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1327]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022 
[1327]PETSC ERROR: timed_model.py on a  named nid05422 by shaima0d Tue Oct 11 07:40:01 2022
[1327]PETSC ERROR: Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-petsc4py=1 --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --download-ctetgen --download-eigen --download-triangle --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
[1327]PETSC ERROR: #1 User provided function() at unknown file:0
[1327]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 1327 (rank 1327 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 1327

Happy to work with you and have a virtual meeting on this if it is needed.

Oct 11 '22 08:10 mshaikh786

Hi @mshaikh786, Sorry for only responding now - thanks again for the results.

The error message you report is a general one for a memory corruption. Unfortunately I can't get a good idea of what's causing the issue.

I think you're right and a non-scalable overhead is causing the issue when job nodes is >256. Perhaps try 16^3 local range and 256 nodes?

cheers, J

Nov 15 '22 23:11 julesghub

underworld2 underworld2 copied to clipboard

3D model: issue with HPC on Segmentation Violation

underworld2
underworld2 copied to clipboard