WRF icon indicating copy to clipboard operation
WRF copied to clipboard

wrf.exe aborts after upgrading to WRF v4.7.0.

Open nalssi89 opened this issue 6 months ago • 1 comments

Subject: MPI Error: MPI_Type_match_size Fails for 1-Byte Data in frame/collect_on_comm.c with Quilt I/O (WRF 4.7.0)

WRF Version: 4.7.0

Affected File: frame/collect_on_comm.c

Issue Description:

When using quilt I/O (e.g., nio_tasks_per_group > 0 and nio_groups > 0 in &namelist_quilt), a fatal MPI error occurs:

Abort(873079564) on node 850 (rank 850 in comm 0): Fatal error in PMPI_Type_match_size: Invalid argument, error stack: PMPI_Type_match_size(199): MPI_Type_match_size(typeclass=1, size=1, datatype=0x7ffd4a2d7180) failed PMPI_Type_match_size(178): No MPI datatype available for typeclass MPI_TYPECLASS_REAL and size 1 This happens because the col_on_comm and dst_on_comm functions in frame/collect_on_comm.c attempt to find an MPI datatype using MPI_Type_match_size(MPI_TYPECLASS_REAL, *typesize, &dtype) first.

When quilt I/O is active, typesize can be 1 (likely for MPI_CHAR, MPI_BYTE, or LOGICAL1 data being aggregated). The call MPI_Type_match_size(MPI_TYPECLASS_REAL, 1, ...) then correctly fails, as there is no standard 1-byte MPI real type, leading to the MPI abort.

Successful Workaround:

The issue was resolved locally by modifying the logic in col_on_comm and dst_on_comm to attempt MPI_Type_match_size(MPI_TYPECLASS_INTEGER, *typesize, &dtype) before MPI_TYPECLASS_REAL, particularly when *typesize == 1.

For example, changing the order of checks for 1-byte data:

C

// In col_on_comm and dst_on_comm

if (*typesize == 1) { // For 1-byte data, try MPI_TYPECLASS_INTEGER first (covers MPI_CHAR, MPI_BYTE, etc.) ierr = MPI_Type_match_size(MPI_TYPECLASS_INTEGER, *typesize, &dtype); if (MPI_SUCCESS != ierr) { // Handle error if a 1-byte integer type is also not found fprintf(stderr,"%s %d FATAL ERROR: unhandled typesize = %d (tried as 1-byte INTEGER)!!\n", FILE,LINE,*typesize) ; MPI_Abort(MPI_COMM_WORLD,1) ; } } else { // Original logic for other typesizes ierr = MPI_Type_match_size (MPI_TYPECLASS_REAL, *typesize, &dtype); if (MPI_SUCCESS != ierr) { ierr = MPI_Type_match_size (MPI_TYPECLASS_INTEGER, *typesize, &dtype); if (MPI_SUCCESS != ierr) { fprintf(stderr,"%s %d FATAL ERROR: unhandled typesize = %d!!\n", FILE,LINE,*typesize) ; MPI_Abort(MPI_COMM_WORLD,1) ; } } } With this (or a similar) modification where MPI_TYPECLASS_INTEGER is prioritized for 1-byte data, the model runs successfully with quilt I/O enabled.

Suggestion:

Please review the data type handling logic in frame/collect_on_comm.c for scenarios involving 1-byte data types, especially when quilt I/O is used. Prioritizing MPI_TYPECLASS_INTEGER for 1-byte data seems to be a viable solution.

Thank you.

nalssi89 avatar May 13 '25 08:05 nalssi89

@nalssi89 If you'd like, please try out the solution in the linked PR. Further investigation showed that (1) this call critically fails for anything that doesn't match the typeclass and (2) critical failure should not be the expected behavior for a simple lookup.

As this segment of code is a newer addition meant to handle a critical error on large domains, I'm wary of changing the logic too much. I think a potentially cleaner solution may be to just check for MPI_TYPECLASS_INTEGER since the gather/scatter just operates on void * and what we care about is minimizing the number of displacement offsets.

As a stopgap however, circumventing the critical error and allowing the existing code to catch errors instead is a decent first pass.

islas avatar May 30 '25 19:05 islas