mpich icon indicating copy to clipboard operation
mpich copied to clipboard

hydra: import libpmi

Open hzhou opened this issue 3 years ago • 1 comments

Pull Request Description

This is a split from https://github.com/pmodels/mpich/pull/5860

Import libpmi to hydra to prepare for refactoring hydra with the pmi wire utilities.

  • Modify build to embed pmi to hydra
  • Update pmi utilities to prepare for additional functionalities needed by hydra

[skip warnings]

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Aug 09 '22 15:08 hzhou

test:mpich/pmi

1 failure with ch4-ofi-pmi2:

summary_junit_xml.1276 - ./coll/p_red 5 MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE=0 MPIR_CVAR_IREDUCE_INTRA_ALGORITHM=tsp_tree MPIR_CVAR_IREDUCE_TREE_TYPE=kary MPIR_CVAR_IREDUCE_TREE_KVAL=3 MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE=4096
Failing for the past 2 builds (Since Unstable[#122](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-pmi/compiler=gnu,jenkins_configure=pmi2,label=centos64_review,netmod=ch4-ofi/122/) )
[Took 10 sec.](https://jenkins-pmrs.cels.anl.gov/job/mpich-review-pmi/123/compiler=gnu,jenkins_configure=pmi2,label=centos64_review,netmod=ch4-ofi/testReport/(root)/summary_junit_xml/1276_____coll_p_red_5__MPIR_CVAR_IREDUCE_DEVICE_COLLECTIVE_0_MPIR_CVAR_IREDUCE_INTRA_ALGORITHM_tsp_tree_MPIR_CVAR_IREDUCE_TREE_TYPE_kary_MPIR_CVAR_IREDUCE_TREE_KVAL_3_MPIR_CVAR_IREDUCE_TREE_PIPELINE_CHUNK_SIZE_4096/history)
Error Message

not ok 1276 - ./coll/p_red 5

Stacktrace

not ok 1276 - ./coll/p_red 5
  ---
  Directory: ./coll
  File: p_red
  Num-procs: 5
  Timeout: 180
  Date: "Tue Aug  9 14:16:46 2022"
  ...
## Test output (expected 'No Errors'):
## libfabric:118090:1660072606::psm3:av:psmx3_epid_to_epaddr():234<warn> psm3_ep_connect returned error Operation timed out, remote epid=0x901758a03:fe80000000000000:98039b03000c57fe.Try setting FI_PSM3_CONN_TIMEOUT to a larger value (current: 10 seconds).
## 
## p_red:118090 terminated with signal 6 at PC=7fe1d60b0337 SP=7ffc5ac22c18.  Backtrace:
## /lib64/libc.so.6(gsignal+0x37)[0x7fe1d60b0337]
## /lib64/libc.so.6(abort+0x148)[0x7fe1d60b1a28]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x2354aea)[0x7fe1d879caea]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x2354c6a)[0x7fe1d879cc6a]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x2368968)[0x7fe1d87b0968]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3991b1)[0x7fe1d67e11b1]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3a33e5)[0x7fe1d67eb3e5]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x350003)[0x7fe1d6798003]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3504ff)[0x7fe1d67984ff]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x35011f)[0x7fe1d679811f]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3504ff)[0x7fe1d67984ff]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x351546)[0x7fe1d6799546]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x35220d)[0x7fe1d679a20d]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x45b4bc)[0x7fe1d68a34bc]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3f4431)[0x7fe1d683c431]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3fa244)[0x7fe1d6842244]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3fa348)[0x7fe1d6842348]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(+0x3fa54b)[0x7fe1d684254b]
## /var/lib/jenkins-slave/workspace/mpich-review-pmi/compiler/gnu/jenkins_configure/pmi2/label/centos64_review/netmod/ch4-ofi/_inst/lib/libmpi.so.0(PMPI_Wait+0x27e)[0x7fe1d668aeae]
## ./p_red[0x401aba]
## /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe1d609c505]
## ./p_red[0x401be0]

I suspect this is due to MPIR_pmi_barrier not serving correctly as a barrier in PMI-2

hzhou avatar Aug 09 '22 18:08 hzhou

test:mpich/pmi

compiler=gnu,jenkins_configure=pmi2,label=centos64_review,netmod=ch4-ofi

summary_junit_xml.2220 - ./spawn/spawn1 1 | 3 min 0 sec | 1
summary_junit_xml.2221 - ./spawn/spawn2 1 | 3 min 0 sec | 1
summary_junit_xml.2222 - ./spawn/spawninfo1 1 | 3 min 0 sec | 1
summary_junit_xml.2223 - ./spawn/spawnminfo1 1 | 3 min 0 sec | 1
summary_junit_xml.2224 - ./spawn/spawnintra 1 | 3 min 0 sec | 1
summary_junit_xml.2226 - ./spawn/spawnargv 1 | 3 min 0 sec | 1
summary_junit_xml.2227 - ./spawn/spawnmanyarg 1 | 3 min 0 sec | 1
summary_junit_xml.2239 - ./spawn/disconnect_reconnect3 3 | 3 min 0 sec | 1
summary_junit_xml.2348 - ./f77/spawn/spawnf 1 | 3 min 0 sec | 1
summary_junit_xml.2349 - ./f77/spawn/spawnargvf 1 | 3 min 0 sec | 1
summary_junit_xml.2350 - ./f77/spawn/spawnmultf 1 | 3 min 0 sec | 1
summary_junit_xml.2505 - ./f90/spawn/spawnf90 1 | 3 min 0 sec | 1
summary_junit_xml.2506 - ./f90/spawn/spawnargvf90 1 | 3 min 0 sec | 1
summary_junit_xml.2507 - ./f90/spawn/spawnmultf90 1 | 3 min 0 sec | 1
summary_junit_xml.2510 - ./f90/spawn/spawnmultf03 1 | 3 min 0 sec | 1

The PMI2_KVS_Fence in hydra doesn't work with dynamic process (as expected)

hzhou avatar Aug 11 '22 03:08 hzhou

test:mpich/pmi

hzhou avatar Aug 11 '22 22:08 hzhou

test:mpich/pmi test:mpich/ch3/most test:mpich/ch4/most

All clear ✔️

hzhou avatar Aug 12 '22 04:08 hzhou

test:mpich/pmi

hzhou avatar Aug 15 '22 16:08 hzhou