ompi pack external32 long double conversion (extended 80 / quad 128)

trafficstars

[updated the top-level github description to match the commit message]:

On architectures that store long doubles as 80 bit extended precisions or as 64 bit "float64"s, we need conversions to 128 bit quad precision to satisfy MPI_Pack_external/Unpack_external. I added a couple more arguments to pFunction to know what architecture the 'to' and 'from' buffers are. Previously we had architecture info 'local' and 'remote' but I don't know how to correlate local/remote with to/from without adding more arguments as I did.

With the incresed information about the context, the conversion function can now convert the long double as needed.

I'm using code Lisandro Dalcin contributed for the floating point conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64. These conversion functions require the data to be in local endianness, but one of the sides in pack/unpack is always local so operations can be done in an order that allows the long double conversion to see the data in local endianness.

I also added a path to use __float128 for the conversion for #ifdef HAVE___FLOAT128 as that ought to be the more reliable method than rolling our own bitwise conversions.

The reason for all the arch.h changes is the former code was inconsistent as to how bits were labeled within a byte, and had masks like LONGISxx that didn't match the bits they were supposed to contain.

May 10 '21 07:05 markalle

Fixes https://github.com/open-mpi/ompi/issues/8918 not well tested yet.

So far just tested on Mac.

Needs tested somewhere bigendian, and somewhere that __float128 is defined since I only just added that path and didn't test it at all yet. A testcase is available at https://gist.github.com/markalle/ad7e69f026471e2baa8e842c938d8048

May 10 '21 07:05 markalle

@markalle I'll insist with a point I made before. I believe my code is overall correct for BE arches, but what I'm not sure about is the helper union layout for long double on big endian either 32bit or 64bits. After making sure of the union layout, the rest of the conversion code should work correctly.

May 10 '21 07:05 dalcinl

I wasn't saying your code is wrong for big-endian machines, I'm saying it only converts data that's in the local endianness, it can't be used directly on big-endian data on a little-endian machine, but that's okay because each conversion we do involves either going to or from the local endianness so I've ordered the endianness-conversion and the f80/f128 conversions so it only has to operate on local endianness data.

I'm undecided how much to invest in performance. I think with all the steps involved and multiple copies already taking place, and using memcpy() instead of dereference it's going to be insanely slow. But it's definitely possible to be packing on a 3-byte boundary for example, so I think it has to use memcpy() or at least have that as an option anyway.

I still haven't studied the code carefully enough to be confident that the big-endian side is right. I ran two QEMU setups (ppc / mips) and both were storing long double as a 64-bit double, so I didn't actually get to see how a big-endian with 80-bit extended precision lays out its data. My guess though is that the current code has the padding in the wrong place, although I'm only guessing. Current code for big-endian 80-bit extended:

    unsigned sign  :  1;
    unsigned exp   : 15;
    unsigned pad   : 16;
    unsigned frac1 : 32;
    unsigned frac0 : 32;

I would have expected the pad at the bottom, so sign,exp,frac1,frac0 would all be adjacent. Is there a reason you initially coded it as above? Without a test machine I'm not 100% confident, but I'm inclined to move the padding and make the data adjacent

May 10 '21 20:05 markalle

I didn't go too far with performance, but those memcpy() were awfully expensive in at least some cases, so I put an alignment check and conditionally avoid the memcpy now, and repushed.

And I'm making a guess about what an extended double precision looks like natively on a big endian machine and moving the padding vs what @dalcinl had. I'm not positive though, just suspicious that it ought to be this way.

May 10 '21 23:05 markalle

Is there a reason you initially coded it as above?

I followed the layout in ieee754.h from glibc. Look for union ieee854_long_double in that header [link]

However, I found it suspicious, too. Look at GCC's software-fp implementation [link] They have the pad at the beginning! That cannot be right on i386, right?

The truth is, I don't really know whether there exists any big-endian architecture where long double is extended precision (float80) instead of just double or quad precision. If there is none, then all our care about float80 big-endian is pointless. I would say go ahead with any union layout you want, leave a comment in the code, and let users complain if they get the wrong thing.

May 11 '21 05:05 dalcinl

Thanks, that's a good argument for that format then. So I just repushed to put the padding back where that header had it.

May 11 '21 13:05 markalle

@dalcinl are you happy with this PR? Should we merge it?

May 13 '21 19:05 gpaulsen

@bosilca is out on vacation this week 🍹 ; he probably needs to review this before we merge.

May 13 '21 21:05 jsquyres

@markalle I believe we missed a few important points in our original discussions.

I have a Raspberry Pi at home (armv7l). In there, long double is the same as double. So we need to handle this case and implement float64 <-> float128 conversion routines to handle that case.
I managed to use a qemu VM to install Debian8 on PowerPC 32bit. In there, long double uses 16 bytes, but it is a different format than Intel's float80. For a quick explanation see first section 1.1 here. This format is still used in POWER7/POWER8 and ppc64le Linux builds.

So, in short, a long double <-> quad precision implementation should be done the following way:

If __float128 is available, use it and move on. After handling byte order, you are done.
Otherwise, we look at the LDBL_MANT_DIG and LDBL_MAX_EXP macro from float.h. a) if LDBL_MANT_DIG == 53 and LDBL_MAX_EXP == 1024, then long double is float64 and we need new conversion routines to handle f64 <-> f128. Incidentally, note that this is the case on Windows with Microsoft compilers, even if running on Intel architectures. b) if LDBL_MANT_DIG == 64 and LDBL_MAX_EXP == 16384, then long double should be Intel's float80 extended precision. You already have all the needed bits. c) if LDBL_MANT_DIG == 112 and LDBL_MAX_EXP == 16384, then long double is actually IEEE quad precision float128. So you just need to care about byte order. Note: POWER9 has hardware support for float128, so we may see long double becoming quad precision soon. It seems Fedora has plans for it. d) if LDBL_MANT_DIG == 106 and LDBL_MAX_EXP == 1024 this is IBM extended precision format (relevant for POWER7 and POWER8). I do not know the details of this format (other than it is built out of two double values), but conversion routines should not be hard. We could add support in a subsequent PR. e) Otherwise, ERROR! We do not know the fp format, and errors should never pass silently.

Sorry for not figuring this out before. Fortunately, your none of work so far goes to waste. We just need more. I could contribute the f64 <-> f128 routines. But you will have to want a three to four days for them.

@gpaulsen @jsquyres At this point, I think this is not ready for merge, IMHO we can do better.

May 14 '21 10:05 dalcinl

FWIW and IIRC, the existing conversion subroutine was used between x86_64 and sparcv9

May 14 '21 10:05 ggouaillardet

@markalle I've updated my gist with float64 <-> float128 conversion routines. Please note I also changed the union names used in the implementation.

May 14 '21 13:05 dalcinl

Repushed. I used the new code conversions where @dalcinl added f64 as well as f80, but renamed the functions to f80_to_f128() etc.

The usage from the OMPI level is in the macro that defines a pFunction for LONG_DOUBLE conversion. If it's not converting LONG_DOUBLE or the long double format in both the to/from architectures is the same it does memcpy, then if it has to convert, it does

endianness to local
long double in "from_arch" format to f128, if the from_arch doesn't already have its long double == f128
f128 to long double in "to_arch" format, if the to_arch isn't asking for long double == f128
endianness to to_arch

May 25 '21 20:05 markalle

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/053f2148801eacbef27f8da18a5bd33d

May 25 '21 20:05 ibm-ompi

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0bf5c2ff6dbfecebf2f574543c22194a

May 25 '21 20:05 ibm-ompi

Update: okay, testing shows not quite ready still. I think the detection is still a slight weak spot

May 25 '21 23:05 markalle

Repushed based on ppc testing and my comment above. That path worked fine in the default mode (where __float128 is available) but I also tested artificially turning off HAVE__FLOAT128 just to make it test another path, and in that mode it didn't have a conversion for MANTISSA=106 and EXP=10. There's an argument for erroring out in that case if we don't have a conversion for the detected long double format, but at least for now I decided to allow it since the detection is new and I'd hate to produce a false failure where we error out on a conversion when we didn't have to.

And besides, in the regular mode where I don't artificially push it through an alternate path the code was fine using __float128.

May 26 '21 00:05 markalle

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e86c3001047ea2c0d1440f392563b96f

May 26 '21 00:05 ibm-ompi

opal_copy_functions_heterogeneous.c:522:35: error: __float128 is not supported on this target
                  f128_buf_to += sizeof(__float128);
opal_copy_functions_heterogeneous.c:580:37: error: __float128 is not supported on this target
                  f128_buf_from += sizeof(__float128);

May 26 '21 03:05 gpaulsen

So the systems I've tested on so far are

x86_64 mac (which doesn't have __float128, uses f80_to_f128)
ppc9 linux (uses __float128 conversion path)
ppc9 linux (with __float128 artificially disabled leaving it with no conversion)
x86_64 linux (uses __float128 conversion path)
x86_64 linux (with __float128 artificially disabled so it used f80_to_f128)

It's not a bad sampling, but does leave a bit of a hole for big endian.

May 26 '21 18:05 markalle

@bosilca Can you please take a look as well?

Jun 17 '21 13:06 gpaulsen

For tracing through the changes, I think main pieces are:

long double detection: based on LDMANTDIG and LDEXPSIZE in arch
added from_arch and to_arch arguments to pFunction
walk the array of doubles through multiple conversions until we get from one arch&&endianness to the other
the bulk of the code is the bottom-level f64_to_f128() etc functions

The bottom level f64_to_f128() etc all only operate in the local endianness, but the order of the conversion steps is set up so even if the input/output is the other endianness, it's been converted to local endianness at the time the f64_to_f128() etc is happening.

Jul 27 '21 17:07 markalle

This has been sitting out here for a while. @bosilca does this look good?

Sep 13 '21 13:09 awlauria

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a6708b712cfa36425b4b47b3ad776e14

Sep 13 '21 20:09 ibm-ompi

bot:ibm:pgi:retest

Sep 13 '21 21:09 jjhursey

bot:ibm:pgi:retest

Sep 14 '21 14:09 jjhursey

@bosilca Can you please take a look? Do we need this for v5.0.0?

Sep 24 '21 15:09 gpaulsen

@bosilca @jsquyres ^^^ ?

Oct 14 '21 19:10 gpaulsen

I'm trying to refresh my memory if there were any concerns left in this PR.

One test I ran that I didn't mention in my comments above is for big-endian, I ran a mips machine in QEMU (where the local long double is 64-bit big endian) and used the above code standalone at the level of ldbl_to_f128() and f128_to_ldbl(). I never did find a big endian machine using 80-bit floats to test on, but at least for the 64-bit big endian long doubles, both the detection and both directions of conversion ran correctly. (That path was using our own bitwise code, since that architecture doesn't have "__float128" provided by the system for us to use).

Performance-wise I remember not being super excited by the path that casts to the system's "__float128" for conversion, but I still think that's the most guaranteed-correct code path we could possibly use and so we just have to go with it when it's available, and that's what this PR does in its HAVE___FLOAT128 path.

@dalcinl : any more thoughts on this, it's been sitting untouched or a while now @bosilca / @jsquyres any chance this can go in?

Nov 09 '21 23:11 markalle

I never did find a big endian machine using 80-bit floats to test on

@markalle float80 is an Intel format, and Intel processors are little-endian. So I doubt you will ever find a machine to test float80 on bit-endian.

Have you ever tested your conversion code from this PR on s390x? Travis-CI supports it (arch: [s390x] in .travis.yml).

Nov 10 '21 07:11 dalcinl

I found an ubuntu image for s390x and ran it under QEMU anyway, and it worked there. The OMPI code looks at LDBL_MANT_DIG and LDBL_MAX_EXP and sets OPAL_ARCH_LDMANTDIGIS113 and OPAL_ARCH_LDEXPSIZEIS15 from those, which puts us in a no-conversion path since that looks like the architecture is already using float128, which I think is right.

Stepping through an example native long double inside the s390x qemu:

value printed as a long double: 31415.92653589793238462643
bin 01000000 00001101 11101010 11011111
    10110100 11000101 11010011 10010000
    11000010 00100010 11010100 01101000
    01010110 10001001 00010001 10000101
* sign bit at 0
* exponent starts at bit 1: 1000000 00001101 = 16397, then bias by -16383 = 14
* mantissa = binary fraction with implied leading 1: 1.11101010...
  = 1.9174759848570515371476094866943358629856789306146171355095855994533850415184605964213915285654366016387939453125
* val = 1.917... * 2^14
  = 31415.92653589793238462643382999999...

So it looks like that architecture is using the ieee float128 already.

Nov 18 '21 00:11 markalle

ompi ompi copied to clipboard

pack external32 long double conversion (extended 80 / quad 128)

ompi
ompi copied to clipboard