mca_btl_tcp_frag_send: writev error

Open zhangqiang-hf opened this issue 2 years ago • 1 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I use the source code to build on my platform, and build success Open MPI configuration:

Version: 5.0.0rc16 MPI Standard Version: 3.1 Build MPI C bindings: yes Build MPI Fortran bindings: mpif.h, use mpi, use mpi_f08 Build MPI Java bindings (experimental): no Build Open SHMEM support: false (no spml) Debug build: no Platform file: (none)

Miscellaneous

Atomics: GCC built-in style atomics Fault Tolerance support: mpi HTML docs and man pages: no documentation available hwloc: external libevent: external Open UCC: no pmix: internal PRRTE: internal Threading Package: pthreads

Transports

Cisco usNIC: yes Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no (not found) Open UCX: no OpenFabrics OFI Libfabric: yes (pkg-config: default search paths) Portals4: no (not found) Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes

Accelerators

CUDA support: no ROCm support: no

OMPIO File Systems

DDN Infinite Memory Engine: no Generic Unix FS: yes IBM Spectrum Scale/GPFS: no (not found) Lustre: no (not found) PVFS2/OrangeFS: no

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: loongnix(be similar to centos8)
Computer hardware: loongson platform
Network type: tcp

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem. when I run the HPL2.3 test with cluster(16 compute nodes,each node has 32cores),with the command: mpirun -hostfile iplist -mca btl_tcp_if_include enp4s0f0 -np 512 ./xhpl the 16 nodes begins to run, but when run later, the error occurs as flow:

HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 318000 NB : 304 PMAP : Row-major process mapping P : 16 Q : 32 PFACT : Left NBMIN : 2 NDIV : 2 RFACT : Left BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words

The matrix A is randomly generated for each test.
The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
The relative machine precision (eps) is taken to be 1.110223e-16
Computational tests pass if scaled residuals are less than 16.0

[localhost][[5844,1],459][btl_tcp_frag.c:121:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1232ad918, 8) Bad address(3)

[localhost:00000] * An error occurred in Socket closed [localhost:00000] * reported by process [382992385,459] [localhost:00000] * on a NULL communicator [localhost:00000] * Unknown error [localhost:00000] * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [localhost:00000] * and MPI will try to terminate your MPI job as well)

An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably (it may even hang or crash), the root cause of this problem is the failure of the peer -- that is what you need to investigate. For example, there may be a core file that you can examine. More generally: such peer hangups are frequently caused by application bugs or other external events.

Local host: master Local PID: 3405 Peer host: localhost

[root@master linpack]#

Please help me to resolve this issue, I have no ideas, thank you very much!

Dec 05 '23 03:12 zhangqiang-hf

Sorry, this issue got missed.

The error message is trying to indicate the usual cause of this issue:

An MPI communication peer process has unexpectedly disconnected. This usually indicates a failure in the peer process (e.g., a crash or otherwise exiting without calling MPI_FINALIZE first).

I.e., it usually means that one or more of your MPI processes has crashed. Open MPI detected the issue when one of the surviving processes tried to contact one of the dead processes and noticed, "Oh, hey, that peer process is dead!".

I don't see any other obvious error messages, so there's not really any clues here about why 1 (or more?) MPI processes died.

Given that there was a little time elapsed since we replied here, were you able to resolve the issue?

Jan 16 '24 16:01 jsquyres

mca_btl_tcp_frag_send: writev error

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I use the source code to build on my platform, and build success Open MPI configuration:

Miscellaneous

Transports

Accelerators

OMPIO File Systems

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver

Local host: master Local PID: 3405 Peer host: localhost

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.