mpich icon indicating copy to clipboard operation
mpich copied to clipboard

datatype: allow MPI_Type_size on unspported types

Open hzhou opened this issue 8 months ago • 13 comments

Pull Request Description

Builitin MPI types may not be supported, for example, when the compiler does not support the type. A common case is for Fortran types such as MPI_REAL when MPICH is configured without Fortran binding. Since there is no standard way for users check whether a builtin type is supported, MPI_Type_size can be used as a check. MPI_Type_size will return 0 if a builtin type is not supported. For MPI_Type_size to work as a check, it should not return error when the datatype is not supported. Thus, we need a custom error_check, rather than the default MPIR_ERRTEST_DATATYPE which will return error on unsupported types.

Fixes #7341

[skip warnings]

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Mar 17 '25 15:03 hzhou

test:mpich/ch3/tcp

hzhou avatar Mar 17 '25 16:03 hzhou

https://github.com/pmodels/mpich/issues/7341#issuecomment-2726386689

Why 0 and not MPI_UNDEFINED?

jeffhammond avatar Mar 17 '25 20:03 jeffhammond

#7341 (comment)

Why 0 and not MPI_UNDEFINED?

As @dalcinl mentioned, it is an MPICH backward-compatible behavior.

For me, a non-existent datatype's size is 0 kinda makes sense. On the other hand, telling me a datatype's size is -32766 will leave me puzzled until I being reminded this is MPICH and the value happens to be MPI_UNDEFINED. That is, MPI_UNDEFINED is not intuitive unless one is looking for it.

hzhou avatar Mar 17 '25 21:03 hzhou

Is MPICH implementing the MPI 5.0 standard in the MPI ABI or will it make arbitrary changes?

We discussed this semantic during a virtual meeting where you were present. I am surprised that no objections were raised in the past six months since this change was made, largely in response to your objections to my original proposal for Fortran.

jeffhammond avatar Mar 18 '25 09:03 jeffhammond

For me, a non-existent datatype's size is 0 kinda makes sense. On the other hand, telling me a datatype's size is -32766 will leave me puzzled until I being reminded this is MPICH and the value happens to be MPI_UNDEFINED. That is, MPI_UNDEFINED is not intuitive unless one is looking for it.

MPI_UNDEFINED is -32766 in the MPI 5.0 ABI.

jeffhammond avatar Mar 18 '25 09:03 jeffhammond

Is MPICH implementing the MPI 5.0 standard in the MPI ABI or will it make arbitrary changes?

We discussed this semantic during a virtual meeting where you were present. I am surprised that no objections were raised in the past six months since this change was made, largely in response to your objections to my original proposal for Fortran.

Now I expressed my objection, so you should no longer be surprised. :) I admit I am guilty of not paying attention. We can return MPI_UNDEFINED in MPI_Type_size.

That said, I hope we still can discuss based on merit and logic.

As a software developer, it is easy to "accidentally" assign a value of type size to an unsigned integer or a char, because one would expect a type size fit in a byte. Thus, a sentinel value of -32766 can survive for a while and leading to mysterious behaviors. More people may do that than who would peruse the MPI standard.

In general, I don't think having a "universal" sentinel (MPI_UNDEFINED) is a good idea. But that ship sailed long before my time.

hzhou avatar Mar 18 '25 13:03 hzhou

Actually, I think we need to change it. The C standard library uses size_t (unsigned) for size parameters, which means that if a user fails to check the size, bad things can happen. In most cases, the segfault will be quick, but relying on that seems silly.

% gcc -Wall -Wextra -Wpedantic size.c && ./a.out || cat size.c 
size=-32766
foo: size=18446744073709289488
zsh: segmentation fault  ./a.out
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int MPI_Type_size(int handle, int * size)
{
    (void)handle;
    *size = -32766;
    return 0;
}

void foo(size_t size)
{
    printf("foo: size=%zu\n", size);
}

int main(void)
{
    int size;
    MPI_Type_size(0,&size);
    printf("size=%d\n", size);
    foo( size * sizeof(double) );
    double *buf = malloc( size * sizeof(double) );
    memset(buf, 0xFF, size * sizeof(double) );
    printf("buf=%p, buf[0]=%f\n", (void*)buf, buf[0]);
    free(buf);
    return 0;
}

jeffhammond avatar Mar 18 '25 13:03 jeffhammond

@dalcinl If we simply return error on unsupported builtin types, will you be able to deal with it (via error return or exception) in mpi4py?

hzhou avatar Mar 18 '25 15:03 hzhou

If we simply return error on unsupported builtin types

I'm not putting my Python agenda in front to discuss/propose things that MPI has to improve or change. Whatever is useful for Python, would most likely be useful for Julia, C++, Java, or even Fortran bindings. It looks really silly to have something like MPI_REAL2 while not having a robust way to check whether the thing can actually be used.

Relying on MPI error handling does not play well with MPI_ERRORS_ARE_FATAL, no matter what the language binding is. Moreover, MPI_ERRORS_ARE_FATAL is the error handler that would trigger by default in this case. Asking the MPI users to go on a get/set/restore error handler on COMM_WORLD/COMM_SELF just to query for availability of a datatype is IMHO an unreasonable burden.

All that being said, I'm currently relying on error handling and Python exceptions to run mpi4py tests.

At this point, I leaning towards relying on MPI_Abi_get/set_fortran_info(). It is a string-based interface, thus not ideal, but at least it should be reliable. The only minor nit is that, once again, you have to rely in error handling to know if the Fortran ABI has been already set. I need to think more about it, I may find additional oddities that make it impractical.

PS: The definitive solution is to have an API like MPI_Type_supported().

dalcinl avatar Mar 18 '25 19:03 dalcinl

@hzhou I would say let's put this PR on hold for a little. This is the kind of troubles that happen when people design and discuss APIs without the feedback from a prototype implementation. Maybe we can can come up with an alternative to relying on special-casing MPI_Type_size.

dalcinl avatar Mar 18 '25 19:03 dalcinl

If we simply return error on unsupported builtin types

I'm not putting my Python agenda in front to discuss/propose things that MPI has to improve or change. Whatever is useful for Python, would most likely be useful for Julia, C++, Java, or even Fortran bindings. It looks really silly to have something like MPI_REAL2 while not having a robust way to check whether the thing can actually be used.

MPI_REAL2 can be queried from MPI_Abi_get_fortran_info already. My mistake seems to have been not adding the unsized types there.

PS: The definitive solution is to have an API like MPI_Type_supported().

If you are reading the standard to know MPI_Type_supported exists, you already know how how to implement this function yourself. https://github.com/mpi-forum/mpi-issues/issues/977#issuecomment-2736409475

I don't know why people think a new function is a solution to the problem of users not reading the standard.

jeffhammond avatar Mar 19 '25 12:03 jeffhammond

mpi-forum/mpi-issues#977 (comment)

Your code relies on MPI_Type_size() returning MPI_UNDEFINED. And that's precisely what people are objecting to, that is, the re-purposing of MPI_Type_size with the return of a bogus value to detect unsupported datatypes.

I don't know why people think a new function is a solution to the problem of users not reading the standard.

I have no idea what you are talking about. We all agree that we need a robust way (meaning it does not involve error handling) to query for unsupported datatypes. A sentinel output from MPI_Type_size was proposed as such mechanism, originally MPI_UNDEFINED, later 0 (zero). But as it usually happen with re-purposing APIs for unrelated tasks, there is no agreement on what the output sentinel should be, some people do not like 0 (zero), some others do not like MPI_UNDEFINED. An obvious way of of the conundrum is to add a new API specific for the problem to solve.

dalcinl avatar Mar 19 '25 14:03 dalcinl

If you are reading the standard to know MPI_Type_supported exists, you already know how how to implement this function yourself.

This reasoning is flawed. People who use MPI_Type_size most likely will not be aware that the MPI builtin type may not be available. They believe they know what MPI_Type_size does. People don't read the standard when they think they know what they are doing. On the other hand, people who are aware that some of the builtin types may not be supported -- likely learned by an error message or, worse, a lesson from crashes and bugs -- will look for a way to check the types before use. They will search the manual when they know they don't know for their purpose.

PS: When people search the manual for solution, functions like MPI_Type_supported will only take a minute for them to discover. On the other hand, finding that one can use MPI_Type_size and then check its return against MPI_UNDEFINED is more difficult.

hzhou avatar Mar 19 '25 14:03 hzhou

Update: now mpich internally use internal datatypes and maintain a table that maps builtin types such as MPI_INTEGER to an internal type. When the type is not supported at configure time and haven't been updated during runtime, it by default maps to MPI_DATATYPE_NULL. So MPI_Type_size(MPI_INTEGER) when Fortran is disabled will have the exact same behavior as MPI_Type_size(MPI_DATATYPE_NULL). Unfortunately, the latter behavior is to return an error (and causing fatal abort if unhandled).

Semantically, I think it makes sense to have the same behavior as MPI_DATATYPE_NULL when the type is unsupported. Whether that should result an error or 0 or MPI_UNDEFINED remain to be discussed.

Anyway, I am going to close this PR. If needed, open a separate issue.

hzhou avatar Oct 16 '25 19:10 hzhou