LBPM icon indicating copy to clipboard operation
LBPM copied to clipboard

MPI index error

Open thomaram opened this issue 4 years ago • 2 comments

The MPI process fails systematically for large system sizes. Overflow in MPI index suspected.

thomaram avatar Nov 25 '21 15:11 thomaram

Can you confirm if this is fixed based on removing the memory leak in the FlowAdaptor?

JamesEMcClure avatar Dec 21 '21 20:12 JamesEMcClure

Hi @JamesEMcClure,

Hope You are doing well.

Any updates on the bug? I tried to profile the lbpm_greyscale_simulator for Nvidia A100 80GB GPU and catch the fails on the systems 800^ 3 voxels and larger.

Hardware: RAM: 100 Gb GPU: Nvidia A100 80GB GPU Software: NVIDIA HPC SDK 23.1 bundle (OpenMPI 3.1.5 + mpicxx GPU compatible). GCC 12.2.0 CUDA 12.0.r12.0 V12.0.76

The simulations with smaller sample sizes run successfully.

Log: Since A100 80GB is capacious the execution was on single thread

mpirun -n 1 -bind-to core lbpm_greyscale_simulator 
input.db 
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            gn27
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4125

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
********************************************************
Running Greyscale Single Phase Permeability Calculation 
********************************************************
MPI rank=0 will use GPU ID 0 / 1 
voxel length = 1.780000 micron 
voxel length = 1.780000 micron 
Input media: 810_800_800.raw
Relabeling 3 values
oldvalue=0, newvalue =2 
oldvalue=1, newvalue =1 
oldvalue=2, newvalue =0 
Dimensions of segmented image: 800 x 800 x 810 
Reading 8-bit input data 
Read segmented data from 810_800_800.raw 
Label=0, Count=103208847 
Label=1, Count=23152925 
Label=2, Count=392038228 
Distributing subdomains across 1 processors 
Process grid: 1 x 1 x 1 
Subdomain size: 800 x 800 x 810 
Size of transition region: 0 
Media porosity = 0.243753 
Initialized solid phase -- Converting to Signed Distance function 
Domain set.
Create ScaLBL_Communicator 
Set up memory efficient layout, 126361772 | 126361792 | 522281648 
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[53824,1],0] (PID 4006275)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
Unhandled exception caught:
   std::bad_array_new_length
Bytes used = 3078802112
Stack Trace:
 
0x00000000004c84d0:  lbpm_greyscale_simulator           StackTrace::backtrace() 
0x00000000004b5352:  lbpm_greyscale_simulator                         rethrow() 
0x00000000004b5514:  lbpm_greyscale_simulator                                   
0x00001555546b9b96:        libstdc++.so.6                                   
0x00001555546b9c01:        libstdc++.so.6                                   
0x00001555546b9e43:        libstdc++.so.6                                   
0x00001555546ae5d1:        libstdc++.so.6                                   
0x00000000004ebd02:  lbpm_greyscale_simulator   ScaLBL_GreyscaleModel::Create() 
0x000000000046d619:  lbpm_greyscale_simulator                              main 
0x0000155553d46555:             libc.so.6                 __libc_start_main 
0x000000000046d3ee:  lbpm_greyscale_simulator                                  --------------------------------------------------------------------------

Since the exception throws on the following cases:

  1. array length is negative
  2. total size of the new array would exceed implementation-defined maximum value
  3. the number of initializer-clauses exceeds the number of elements to initialize

I assume that IntArray Map in ScaLBL_GreyscaleModel::Create method exceeds the implementation.

Since I'm not sure about on the above, I post in this thread.

OlhinAS avatar Mar 22 '24 09:03 OlhinAS