mpich icon indicating copy to clipboard operation
mpich copied to clipboard

bug/jenkins: timeouts for noncontiguous data transfer in am path

Open pavanbalaji opened this issue 5 years ago • 2 comments

OFI's noncontiguous data movement management currently uses IOVs instead of packing data for RMA operations (all configurations) and send/recv operations (in the AM-only configuration). This needs to be improved to use pack/unpack when the data density is low (average size of the contiguous buffers is small).

pavanbalaji avatar Jun 26 '19 21:06 pavanbalaji

We are having consistent failures in direct-nm configuration that looks very similar to the am-only failures reported in this issue. The details are pasted here and we are going to assume it is of same issue. We need open a new issue if we discover it unrelated upon investigation.

Jenkins - mpich-review-ch4-ofi - #147 - gnu,direct-nm,centos64

summary_junit_xml.1059 - ./rma/lockall_dt_flush 4 -type=MPI_INT -count=65530 -seed=209 -testsize=16
 Error Details

not ok 1059 - ./rma/lockall_dt_flush 4

 Stack Trace

not ok 1059 - ./rma/lockall_dt_flush 4
  ---
  Directory: ./rma
  File: lockall_dt_flush
  Num-procs: 4
  Timeout: 180
  Date: "Tue Aug  6 11:29:40 2019"
  ...
## Test output (expected 'No Errors'):
## [[email protected]] APPLICATION TIMED OUT, TIMEOUT = 180s

hzhou avatar Aug 07 '19 20:08 hzhou

For send-recv am opearations, the solution is to have an alternative lmt protocol where the sender sends data in multiple segments and the recver do MPIR_Typerep_unpack for each segment.

It is much easier to implement this new lmt protocol once the refactoring PR #4323 gets merged.

hzhou avatar Mar 30 '20 17:03 hzhou