mpich
mpich copied to clipboard
test/mpi: Failures in DTPools tests on 32-bit CentOS
These specific failures were evinced during the addition of GPU support to the testsuite. There is a segfault during buffer initialization inside DTPools. 32-bit CentOS, ch3/tcp configuration.
./pt2pt/sendrecv1 2 -type=MPI_INT:4+MPI_DOUBLE:8 -sendcnt=65530 -recvcnt=65530 -seed=301 -testsize=32 -sendmem=host -recvmem=host
./rma/lockall_dt_flushlocal 4 -type=MPI_INT -count=65530 -seed=1493 -testsize=16 -origmem=host -targetmem=host -resultmem=host
Is this failing without GPU support?
Yes.
I should say the sefault during initialization is only for the sendrecv1
test. The RMA test is crashing during an MPI_BARRIER
.
not ok - ./rma/lockall_dt_flushlocal 4
---
Directory: ./rma
File: lockall_dt_flushlocal
Num-procs: 4
Timeout: 600
Date: "Sat Jun 13 20:58:04 2020"
...
## Test output (expected 'No Errors'):
## Fatal error in PMPI_Barrier: Unknown error class, error stack:
## PMPI_Barrier(253)..........................: MPI_Barrier(comm=comm=0x84000005) failed
## PMPI_Barrier(239)..........................:
## MPIR_Barrier_impl(146).....................:
## MPIR_Barrier_intra_smp(36).................:
## MPIR_Barrier_impl(146).....................:
## MPIR_Barrier_intra_dissemination(43).......:
## MPIDI_CH3U_Complete_posted_with_error(1099): Process failed
## MPIR_Barrier_intra_smp(52).................:
## MPIR_Bcast_impl(261).......................:
## MPIR_Bcast_intra_binomial(178).............: Failure during collective
##
## ===================================================================================
## = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
## = PID 392756 RUNNING AT pmrs-centos64-240-07.cels.anl.gov
## = EXIT CODE: 11
## = CLEANING UP REMAINING PROCESSES
## = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
## ===================================================================================
## YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
## This typically refers to a problem with your application.
## Please see the FAQ page for debugging suggestions
@raffenet Could you update the status of this issue? In particular, which test with which config option?
There is nothing to update at this time. The issue description contains the tests and configuration options. Are you looking for something else?
I wonder since I don't recall seeing the failures in nightly Jenkins.
They are seed specific. The particular seed value was only used during development. It did not make it into the merge.
They are seed specific. The particular seed value was only used during development. It did not make it into the merge.
I see. Yaksa is not 32-bit clean at the moment, which may be related.
@hzhou Yaksa is 32-bit clean. Are you seeing some error on 32-bit systems?
@hzhou Yaksa is 32-bit clean. Are you seeing some error on 32-bit systems?
Yes. Currently Jenkins nightly tests is failing on centos32
because of it. See https://jenkins-pmrs.cels.anl.gov/job/mpich-main-ch4-ofi/compiler=gnu,fabric_prov=sockets,jenkins_configure=default,label=centos32/lastCompletedBuild/testReport/
Last time I traced it, it appears in the generated pup functions, some of the assumptions of where elements start (alignment) is incorrect on 32-bit systems.
I don't think this is a yaksa bug. The user of yaksa (MPICH) is required to make sure the data buffer that's passed in is correctly aligned for the datatype being used. If MPICH is not doing that (maybe because it copied something into a temporary buffer), then that's incorrect usage of yaksa.
Note that yaksa has that requirement for performance reasons.
I only looked at one case a few months ago. The issue is with long double
. On 32-bit systems, the alignment of long double
is 4
, so it is correct to have e.g. displacement at boundary of 4 rather than 16. In yaksuri_seqi_pup_blkhindx_hvector_long_double.c
it indexes with array_of_displs[j1]/sizeof(long double)
become incorrect as it rounds off a fraction.
In general, I think we should not assume the alignment is always multiple of sizeof
.
That is the user's (MPICH) responsibility. For example, if the user passes a buffer to Yaksa as a collection of integers, then that buffer must be aligned for integers. The below code would be incorrect, even though some platforms like x86 would allow the application to run correctly:
int *foo = (int *) malloc(100 * sizeof(int));
int *bar = (int *) ((char *) foo + 2);
yaksa_ipack(bar, ...);
Before calling yaksa_ipack
, the user must ensure that bar
is correctly aligned to integer boundaries. The same logic holds true for all datatypes (including long double
).
I can look at getting gcc-9 installed in the Singularity image. That should allow us to run ubsan on it and get some more info on possible alignment bugs.
Can no longer reproduce this with main
or 4.1.2
, so I am going to close.