MPI index error
The MPI process fails systematically for large system sizes. Overflow in MPI index suspected.
Can you confirm if this is fixed based on removing the memory leak in the FlowAdaptor?
Hi @JamesEMcClure,
Hope You are doing well.
Any updates on the bug? I tried to profile the lbpm_greyscale_simulator for Nvidia A100 80GB GPU and catch the fails on the systems 800^ 3 voxels and larger.
Hardware: RAM: 100 Gb GPU: Nvidia A100 80GB GPU Software: NVIDIA HPC SDK 23.1 bundle (OpenMPI 3.1.5 + mpicxx GPU compatible). GCC 12.2.0 CUDA 12.0.r12.0 V12.0.76
The simulations with smaller sample sizes run successfully.
Log: Since A100 80GB is capacious the execution was on single thread
mpirun -n 1 -bind-to core lbpm_greyscale_simulator
input.db
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: gn27
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4125
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
********************************************************
Running Greyscale Single Phase Permeability Calculation
********************************************************
MPI rank=0 will use GPU ID 0 / 1
voxel length = 1.780000 micron
voxel length = 1.780000 micron
Input media: 810_800_800.raw
Relabeling 3 values
oldvalue=0, newvalue =2
oldvalue=1, newvalue =1
oldvalue=2, newvalue =0
Dimensions of segmented image: 800 x 800 x 810
Reading 8-bit input data
Read segmented data from 810_800_800.raw
Label=0, Count=103208847
Label=1, Count=23152925
Label=2, Count=392038228
Distributing subdomains across 1 processors
Process grid: 1 x 1 x 1
Subdomain size: 800 x 800 x 810
Size of transition region: 0
Media porosity = 0.243753
Initialized solid phase -- Converting to Signed Distance function
Domain set.
Create ScaLBL_Communicator
Set up memory efficient layout, 126361772 | 126361792 | 522281648
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[53824,1],0] (PID 4006275)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
Unhandled exception caught:
std::bad_array_new_length
Bytes used = 3078802112
Stack Trace:
0x00000000004c84d0: lbpm_greyscale_simulator StackTrace::backtrace()
0x00000000004b5352: lbpm_greyscale_simulator rethrow()
0x00000000004b5514: lbpm_greyscale_simulator
0x00001555546b9b96: libstdc++.so.6
0x00001555546b9c01: libstdc++.so.6
0x00001555546b9e43: libstdc++.so.6
0x00001555546ae5d1: libstdc++.so.6
0x00000000004ebd02: lbpm_greyscale_simulator ScaLBL_GreyscaleModel::Create()
0x000000000046d619: lbpm_greyscale_simulator main
0x0000155553d46555: libc.so.6 __libc_start_main
0x000000000046d3ee: lbpm_greyscale_simulator --------------------------------------------------------------------------
Since the exception throws on the following cases:
- array length is negative
- total size of the new array would exceed implementation-defined maximum value
- the number of initializer-clauses exceeds the number of elements to initialize
I assume that IntArray Map in ScaLBL_GreyscaleModel::Create method exceeds the implementation.
Since I'm not sure about on the above, I post in this thread.