mpich icon indicating copy to clipboard operation
mpich copied to clipboard

test/mpi: Failures in DTPools tests on 32-bit CentOS

Open raffenet opened this issue 4 years ago • 14 comments

These specific failures were evinced during the addition of GPU support to the testsuite. There is a segfault during buffer initialization inside DTPools. 32-bit CentOS, ch3/tcp configuration.

./pt2pt/sendrecv1 2 -type=MPI_INT:4+MPI_DOUBLE:8 -sendcnt=65530 -recvcnt=65530 -seed=301 -testsize=32 -sendmem=host -recvmem=host
./rma/lockall_dt_flushlocal 4 -type=MPI_INT -count=65530 -seed=1493 -testsize=16 -origmem=host -targetmem=host -resultmem=host

raffenet avatar Jun 15 '20 13:06 raffenet

Is this failing without GPU support?

pavanbalaji avatar Jun 15 '20 13:06 pavanbalaji

Yes.

raffenet avatar Jun 15 '20 13:06 raffenet

I should say the sefault during initialization is only for the sendrecv1 test. The RMA test is crashing during an MPI_BARRIER.

not ok  - ./rma/lockall_dt_flushlocal 4
  ---
  Directory: ./rma
  File: lockall_dt_flushlocal
  Num-procs: 4
  Timeout: 600
  Date: "Sat Jun 13 20:58:04 2020"
  ...
## Test output (expected 'No Errors'):
## Fatal error in PMPI_Barrier: Unknown error class, error stack:
## PMPI_Barrier(253)..........................: MPI_Barrier(comm=comm=0x84000005) failed
## PMPI_Barrier(239)..........................: 
## MPIR_Barrier_impl(146).....................: 
## MPIR_Barrier_intra_smp(36).................: 
## MPIR_Barrier_impl(146).....................: 
## MPIR_Barrier_intra_dissemination(43).......: 
## MPIDI_CH3U_Complete_posted_with_error(1099): Process failed
## MPIR_Barrier_intra_smp(52).................: 
## MPIR_Bcast_impl(261).......................: 
## MPIR_Bcast_intra_binomial(178).............: Failure during collective
## 
## ===================================================================================
## =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## =   PID 392756 RUNNING AT pmrs-centos64-240-07.cels.anl.gov
## =   EXIT CODE: 11
## =   CLEANING UP REMAINING PROCESSES
## =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions

raffenet avatar Jun 15 '20 14:06 raffenet

@raffenet Could you update the status of this issue? In particular, which test with which config option?

hzhou avatar Sep 12 '20 16:09 hzhou

There is nothing to update at this time. The issue description contains the tests and configuration options. Are you looking for something else?

raffenet avatar Sep 15 '20 20:09 raffenet

I wonder since I don't recall seeing the failures in nightly Jenkins.

hzhou avatar Sep 15 '20 21:09 hzhou

They are seed specific. The particular seed value was only used during development. It did not make it into the merge.

raffenet avatar Sep 15 '20 22:09 raffenet

They are seed specific. The particular seed value was only used during development. It did not make it into the merge.

I see. Yaksa is not 32-bit clean at the moment, which may be related.

hzhou avatar Sep 15 '20 22:09 hzhou

@hzhou Yaksa is 32-bit clean. Are you seeing some error on 32-bit systems?

pavanbalaji avatar Nov 06 '20 06:11 pavanbalaji

@hzhou Yaksa is 32-bit clean. Are you seeing some error on 32-bit systems?

Yes. Currently Jenkins nightly tests is failing on centos32 because of it. See https://jenkins-pmrs.cels.anl.gov/job/mpich-main-ch4-ofi/compiler=gnu,fabric_prov=sockets,jenkins_configure=default,label=centos32/lastCompletedBuild/testReport/

Last time I traced it, it appears in the generated pup functions, some of the assumptions of where elements start (alignment) is incorrect on 32-bit systems.

hzhou avatar Nov 06 '20 14:11 hzhou

I don't think this is a yaksa bug. The user of yaksa (MPICH) is required to make sure the data buffer that's passed in is correctly aligned for the datatype being used. If MPICH is not doing that (maybe because it copied something into a temporary buffer), then that's incorrect usage of yaksa.

Note that yaksa has that requirement for performance reasons.

pavanbalaji avatar Nov 06 '20 15:11 pavanbalaji

I only looked at one case a few months ago. The issue is with long double. On 32-bit systems, the alignment of long double is 4, so it is correct to have e.g. displacement at boundary of 4 rather than 16. In yaksuri_seqi_pup_blkhindx_hvector_long_double.c it indexes with array_of_displs[j1]/sizeof(long double) become incorrect as it rounds off a fraction.

In general, I think we should not assume the alignment is always multiple of sizeof.

hzhou avatar Nov 06 '20 15:11 hzhou

That is the user's (MPICH) responsibility. For example, if the user passes a buffer to Yaksa as a collection of integers, then that buffer must be aligned for integers. The below code would be incorrect, even though some platforms like x86 would allow the application to run correctly:

int *foo = (int *) malloc(100 * sizeof(int));
int *bar = (int *) ((char *) foo + 2);
yaksa_ipack(bar, ...);

Before calling yaksa_ipack, the user must ensure that bar is correctly aligned to integer boundaries. The same logic holds true for all datatypes (including long double).

pavanbalaji avatar Nov 06 '20 15:11 pavanbalaji

I can look at getting gcc-9 installed in the Singularity image. That should allow us to run ubsan on it and get some more info on possible alignment bugs.

raffenet avatar Nov 06 '20 16:11 raffenet

Can no longer reproduce this with main or 4.1.2, so I am going to close.

raffenet avatar Jul 06 '23 20:07 raffenet