chapel Lesser precision float (real(16))

The 16-bit floating point type, what IEEE 754 calls BINARY16, will become de rigeur sooner rather than later. Note that I am not talking about BFLOAT16 (a.k.a. BF16) which is not an IEEE 754 format.

Some modern hardware provides full arithmetic support for such types. Some hardware just provides support to manipulate (but not do 16-bit floating point arithmetic with) such data types. And there is some hardware which falls in between, i.e. they support it only in SIMD instructions.

Consideration should be given, even if just in planning, that Chapel will need to support such types in the foreseeable future.

Those with a vested interest in such a type can definitely better add to the conversation than I can.

May 21 '24 11:05 damianmoz

Just noting here that smaller-precision floats are valuable for GPU programming (especially for AI/ML), as well.

May 21 '24 23:05 e-kayrakli

Yes. I was trying to think of something insightful about GPUs and AI/ML to add to my original post and I very quickly got outside my technological comfort zone. So I will leave that to yourself Engin and others far more competent than I.

May 22 '24 01:05 damianmoz

Noting that this came up on Discord this week as being of interest to the team developing ChAI: https://discord.com/channels/1301640932709629963/1347356485176266813/1347356485176266813

Mar 07 '25 16:03 bradcray

Anything important? I do not use Discord out of privacy concerns. It wants to know my birthday. A bit scary.

Mar 07 '25 18:03 damianmoz

The use case was trying to ingest a (binary?) data file on 16-bit floats for an AI workflow.

It wants to know my birthday. A bit scary.

You can always lie to it. :D

Mar 07 '25 19:03 bradcray

Birthday lies need to be remembered for subsequent verification. Sigh.

That said, it would be nice if the technical discussion about real(16) could try and stay in the one place.

Are people in ChAI trying to use BINARY16 or bfloat16 or some other format? Are they asking for arithmetic support or just the ability to store into and from memory? There will likely be more about this format in IEEE 754 in 2029. There is also an IEEE committee working on an 8-bit floating point format at the moment.

Mar 07 '25 22:03 damianmoz

Are people in ChAI trying to use BINARY16 or bfloat16 or some other format? Are they asking for arithmetic support or just the ability to store into and from memory?

I don't know the answer on format. They're trying to read the values in from a file and convert them to real(32) so that they can compute with the values in Chapel. Here are some code sketches I gave them as ideas for doing the conversion:

testit1b.h:

#include <stdint.h>

static inline float int16ToFloat32(const int16_t* p) {
  return (float)(*((_Float16*)((void*)p)));
}

testit1b.chpl:

use CTypes;

require "testit1b.h";

extern proc int16ToFloat32(const ref p: int(16)): real(32);

var piAsInt16 = 16968: int(16);
var piAsReal32 = int16ToFloat32(piAsInt16);
writeln(piAsReal32);

var piArr16: [1..10] int(16) = 16968;
var piArr32 = int16ToFloat32(piArr16);
writeln(piArr32);

That said, it would be nice if the technical discussion about real(16) could try and stay in the one place.

Sure, but I don't think it's realistic or reasonable to expect anyone who has a question about a topic to search all forums to make sure the topic hasn't been discussed somewhere else first… By analogy, if they'd asked their question there first, I wouldn't expect you to mention this feature request there rather than here.

Birthday lies need to be remembered for subsequent verification. Sigh.

I don't think it's used for verification so much as NSFW access and maybe overly cheery birthday greetings. At least, it's never asked me about it again.

Mar 07 '25 22:03 bradcray

I will go chase it down on Discord and add it to my Discourse and Github. I was trying to avoid Gitter and Stack Overflow.

It seems like they are reading 16-bit integers rather than a binary format used to represent sign, exponent and significand.

Mar 07 '25 22:03 damianmoz

It seems like they are reading 16-bit integers rather than a binary format used to represent sign, exponent and significand.

I don't believe that's the case. The reason for int(16) in my sample was that I was suggesting that since they couldn't read 16-bit floats directly into Chapel, and if the data was in binary format, that they could potentially read it into 16-bit ints, then use C's type looseness to convert to C float / Chapel real(32).

Mar 07 '25 23:03 bradcray

Given 16bits of unsigned integral data that represent what IEEE 754 calls a BINARY16 floating point number, sometimes called float16, the following converts that bit string into a real(32). Please feed this lots of test data that can be verified and let me know the results. This needs more testing.

// return the decoded IEEE 754 real(16) encoding stored within a uint(16)

proc copyReal16BitsToReal32(t : uint(16)) : real(32)
{
    type U = uint(32); // abbreviation
    type R = real(32); // ...ditto...

    param p16 = 11, b16 =  15:U, w16 = 16; // precision, bias, width of real(16)
    param p32 = 24, b32 = 127:U, w32 = 32; // precision, bias, width of real(32)

    param _L =  1:U << (p16 - 1); // the underflow threshold of real(16)
    param _N =  1:U << (w16 - 1); // mask of the negative field of real(16)
    param _E = 31:U << (p16 - 1); // mask of the exponent field of real(16)

    // For a real(16) number which is neither zero nor subnormal, its cast to
    // a real(32) involves copying the sign or most significant bit of the
    // real(16) to that of the real(32), copying the remaining bits of the
    // real(16) to those of the real(32) while ensuring that their implicit
    // binary points are aligned and the excess bits within the real(32) are
    // zero filled, and then subsequently uplifting the real(32)'s own biased
    // exponent field by the difference of a real(32)'s exponent bias and a
    // real(16)'s. With a copy of the real(16)'s encoding within a real(32)'s
    // encoding, both copying tasks can be done in-place using left shifts.

    param sign_only_shift = w32 - w16; // the net of their bit widths
    param magnitude_shift = p32 - p16; // the net of their precisions

    // The uplift is added in-place within the biased exponent field of the
    // IEEE 754 encoding, so any uplift must be aligned to match that field.
    // As this uplift is a function of whether the real(16) is non-finite or
    // normal, there are just two uplift cases. The reference case is the
    // uplift for a non-finite real(16), calculated at compile-time. The
    // alternate case is for a normal real(16), exactly half the reference
    // uplift, i.e. its right shifted variant, which is derived as needed.

    param non_finite_uplift = (b32 - b16) << p32; // appropriately aligned!

    inline proc magnitude_encoding(_a : U)
    {
        param least_positive_nonzero_real16 = 0x1.0p-24:R;

        return if _a < _L then
            // a subnormal or a zero:
            // Such real(16) numbers are an integral multiple of the least
            // positive non-zero real(16) number where its encoding _a is
            // that multiplier. The magnitude is recomputed using real(32)
            // arithmetic its real(32) encoding is retrieved from that.
            (_a:R * least_positive_nonzero_real16).transmute(U)
        else
            // a normal or non-finite:
            // Mapping a real(16) magnitude stored inside a real(32)'s
            // encoding is the left shift of the real(16) encoding and
            // the addition to the biased exponent field of a multiple
            // of the difference of a real(32)'s bias and a real(16)'s,
            // a function of whether _a is normal or non-finite
            (_a << magnitude_shift) + (non_finite_uplift >> (_a < _E):U);
    }

    const _t = t:uint(32);
    const _n = _t & _N;  // negative bit field
    const _a = _t & ~_N; // magnitude (exponent and significand) bit fields

    return ((_n << sign_only_shift) | magnitude_encoding(_a)).transmute(R);
}
private proc unitTest // a few simple test cases - not enough - test more!!
{
    const smallestReal16 = 0b0000010000000000:uint(16);
    const minusbitsReal16 = 0b1000000000000000:uint(16);
    const halfbitsReal16 = minusbitsReal16 | ((14:uint(16)) << 10);
    const tiniest = 0x1.0p-24;
    
    writef("%u\n", smallestReal16);
    writef("%xu\n", smallestReal16);
    writeln(copyReal16BitsToReal32(smallestReal16 | minusbitsReal16));
    writeln(-0x1.0p-14);
    writeln(copyReal16BitsToReal32(halfbitsReal16));
    writeln(-0x1.0p-1);
    writeln(copyReal16BitsToReal32(halfbitsReal16 | (1:uint(16) << 9)));
    writeln(-0x1.8p-1); 
    writeln(copyReal16BitsToReal32(0:uint(32)));
    writeln(copyReal16BitsToReal32(0:uint(32)));
    const t0 = copyReal16BitsToReal32(1:uint(16));
    writeln(t0, "\n", tiniest);
    const t1 = copyReal16BitsToReal32(0x8001:uint(16));
    writeln(t1, "\n", -tiniest);
}   
config const UT = 9;
    
if UT != 0 then unitTest;

Get back to me with bugs or if my quick comments are less than adequate Hope this helps. There might be a library routine for this stuff. However, the above is pure Chapel.

Love that transmute feature!!

Mar 08 '25 02:03 damianmoz

I only just read Brad's C program properly. I did not realise that there was a C compiler that supported _Float16 that could be in the Chapel mix. If that works for you, it would be faster than my ideas.

Sorry, I thought that you already have the byte stream of the ML results already available in Chapel and you wanted to do things solely in Chapel

Mar 09 '25 11:03 damianmoz

static inline float int16ToFloat32(const int16_t* p) {
  return (float)(*((_Float16*)((void*)p)));
}

This might well work but it's type-punning which C compilers don't allow. IMO the best way to write this pattern is with memcpy to a local variable of the new type:

static inline float int16ToFloat32(const int16_t* p) {
  _Float16 tmp;
  memcpy(&tmp, p, 2);
  return (float) tmp;
}

That avoids the type-punning warnings (and potential undefined behavior).

Mar 10 '25 15:03 mppf

While Chapel's optimizer handles memcpy(), I thought the preferred practice these days is

static inline float int16ToFloat32(const int16_t* p) {
  union { _Float16 f; int16_t i; } u = { p };

  return u.i;
}

But yes, type punning is a no-no these days

Mar 10 '25 16:03 damianmoz

Given 16bits of unsigned integral data that represent what IEEE 754 calls a BINARY16 floating point number, sometimes called float16, the following converts that bit string into a real(32). Please feed this lots of test data that can be verified and let me know the results. This needs more testing.

// return the decoded IEEE 754 real(16) encoding stored within a uint(16)

proc copyReal16BitsToReal32(t : uint(16)) : real(32) { type U = uint(32); // abbreviation type R = real(32); // ...ditto...

param p16 = 11, b16 =  15:U, w16 = 16; // precision, bias, width of real(16)
param p32 = 24, b32 = 127:U, w32 = 32; // precision, bias, width of real(32)

param _E = 31:U << (p16 - 1); // mask of the exponent field of real(16)
param _N =  1:U << (w16 - 1); // mask of the negative field of real(16)
param _L =  1:U << (p16 - 1); // the underflow threshold of real(16)

inline proc exponent_adjustment(_a : U)
{
    // For a normal real(16), casting it to real(32) requires its biased
    // exponent field grow by the bias difference, i.e. the net_bias. For a
    // non-finite real(16), the field grows by double the bias difference
    // of the normal case.  Rather than multiplying the net_bias by 2 (or
    // left alone) to get the growth for that non-finite (or normal) case,
    // it will be left (or null) shifted. Determining the non-finite-ness
    // could be done by the (_a & _E) == _E paradigm, but the alternative
    // (_a + _L) >> (p16 - 1) is faster. The net_bias will be compile-time
    // aligned in the biased exponent field. Faster again.

    param aligned_net_bias = (b32 - b16) << (p16 - 1);
    const non_finite_datum = (_a + _L) >> (w16 - 1);

    return aligned_net_bias << non_finite_datum;
}
inline proc real32_encoding(t16 : U)
{
    // In realigning a real(16) encoding to a real(32), the left shift for
    // * its negative is the net of their bit widths, and
    // * its magnitude is the net of their precisions.

    param sign_only_shift = w32 - w16;
    param magnitude_shift = p32 - p16;

    param least_positive_nonzero_real16 = 0x1.0p-24:R;

    const _n = t16 & _N;  // negative bit field
    const _a = t16 & ~_N; // magnitude (exponent and significand) bit fields

    return (_n << sign_only_shift) |
    (
        // map the encoding of the real(16) magnitude to that of a real(32)
        if _a < _L then
            // a subnormal or a zero:
            // Such real(16) numbers are an integral multiple of the least
            // positive non-zero real(16) number where its encoding _a is
            // that multiplier. The magnitude is recomputed using real(32)
            // arithmetic its real(32) encoding is retrieved from that.
            (_a:R * least_positive_nonzero_real16).transmute(U)
        else
            // a normal or non-finite:
            // Mapping a real(16) magnitude to its real(32) equivalent
            // requires (a) enhancing the biased exponent field in _a by
            // the difference of a real(32)'s bias and a real(16)'s when
            // _a is normal, or twice that when _a is non-finite, and
            // (b) aligning that encoding is to match a real(32) so that
            // the real(16) significand's high bits match the real(32)'s
            (_a + exponent_adjustment(_a)) << magnitude_shift
    );
}
return real32_encoding(t).transmute(R);

} private proc unitTest // a few simple test cases - not enough - test more!! { const smallestReal16 = 0b0000010000000000:uint(16); const minusbitsReal16 = 0b1000000000000000:uint(16); const halfbitsReal16 = minusbitsReal16 | ((14:uint(16)) << 10); const tiniest = 0x1.0p-24;

writef("%u\n", smallestReal16);
writef("%xu\n", smallestReal16);
writeln(copyReal16BitsToReal32(smallestReal16 | minusbitsReal16));
writeln(-0x1.0p-14);
writeln(copyReal16BitsToReal32(halfbitsReal16));
writeln(-0x1.0p-1);
writeln(copyReal16BitsToReal32(halfbitsReal16 | (1:uint(16) << 9)));
writeln(-0x1.8p-1); 
writeln(copyReal16BitsToReal32(0:uint(32)));
writeln(copyReal16BitsToReal32(0:uint(32)));
const t0 = copyReal16BitsToReal32(1:uint(16));
writeln(t0, "\n", tiniest);
const t1 = copyReal16BitsToReal32(0x8001:uint(16));
writeln(t1, "\n", -tiniest);

}
config const UT = 9;

if UT != 0 then unitTest; Get back to me with bugs or if my quick comments are less than adequate Hope this helps. There might be a library routine for this stuff. However, the above is pure Chapel.

Love that transmute feature!!

Thank you so much for this implementation! I used it to add support for float16 model loading in the ChAI library.

Mar 17 '25 20:03 Iainmon

Glad I could be of help. The comments in my code could be improved.

Mar 17 '25 22:03 damianmoz

I am attaching some internal documentation on variable names and programming paradigms that occur in routines of ours like the above. Note that internally some names which might have been camel cased in the past are not. Having problems with readability of long multi-word camel cased names and have reverted to underscores to se how they go. So things are in a state of flux there. Apologies for any confusion.

TM-25-001-D01.pdf

Mar 19 '25 01:03 damianmoz

Birthday lies need to be remembered for subsequent verification. Sigh.

That said, it would be nice if the technical discussion about real(16) could try and stay in the one place.

Are people in ChAI trying to use BINARY16 or bfloat16 or some other format? Are they asking for arithmetic support or just the ability to store into and from memory? There will likely be more about this format in IEEE 754 in 2029. There is also an IEEE committee working on an 8-bit floating point format at the moment.

For now, reading and writing from brain floats 'real(16)' was a stopping point for us. But now that we have it (thanks by the way!), the next steps are to support 16 bit tensors and operations on them. I can imagine a world where operations on bare real(16) values just like other types, making it very easy to write real(16) tensor operations. On the other hand, if Chapel doesn't plan on adding real(16)s, then extending a tensor type's eltType's to support the junior float type would be challenging a s introduce many special cases.

Either way, this feature ChAI would make the whole UX work better with modern AI tutorials and projects. (And probably speed up ChAI benchmarks, as fewer vector width per channel for SIMD)

Adding numerical type features seems to be a long and careful process. Knowing what challenges would come with real(16) might help us find some workarounds. Can you enlighten me in this area?

Mar 26 '25 16:03 Iainmon

On the other hand, if Chapel doesn't plan on adding real(16)s

I don't think this is our mindset. I'd describe it as a feature that we've heard interest in, but haven't been able to prioritize to date. And for some time, I think we were worried about whether it was standard enough in the back-ends we support that we could rely on it being available on all platforms (perhaps incorrectly). I also believe that work in recent years that has focused on built-in numeric types has been performed with the thought that it's likely that we'll add new narrower/wider types over time.

Adding numerical type features seems to be a long and careful process. Knowing what challenges would come with real(16) might help us find some workarounds. Can you enlighten me in this area?

I'm not aware of significant challenges offhand to get basic language support—I think it's mostly a matter of plumbing an additional width through the compiler and code generator, and that a lot of it would be fairly straightforward for a Chapel developer, and probably not too difficult for a community contributor by following some breadcrumbs of where real types and widths are defined in the compiler today. The most likely sticking points in the compiler are probably around things like function resolution and implicit conversions—though those are some of the aforementioned areas where I think we've been trying to be mindful of adding additional bitwidths over time. I'm thinking of recent work by @mppf, for example, who may be able to say whether I'm mischaracterizing things, missing key challenges, or oversimplifying the effort required.

Another aspect of the effort will be expanding the standard library to support them, but that could (potentially) be done piecemeal as the need arises, and is probably conceptually simple, just non-trivial in size (since it's probably proportional to the number of library routines accepting floating point values). As an example, a new sin() overload would need to be written, or the current implementation refactored to take a real(?w) and to dispatch to the proper routine depending on the value of w.

Mar 26 '25 17:03 bradcray

I expect that we won't need any resolution changes to add real(16) (or real(128)).

Mar 26 '25 17:03 mppf

For anyone reading who feels inspired to give this a try, the approach I'd take if I were diving into it (as someone who doesn't work in the compiler very much anymore) would be to look for occurrences of dtReal and FLOAT_SIZE_[32|64] in the compiler and expand those code paths to support 16 bit values as well. compiler/AST/symbol.cpp and compiler/AST/type.cpp (and their respective .h files) are the two key places that establish these types. In the runtime include/chpltypes.h maps Chapel-oriented _real32 and _real64 aliases used by the compiler when generating code using the C back-end (which is where I'd start because it's more familiar to me and also easier for me to debug the code that the compiler is generating). LLVM's interface may be more straightforward, though, since it's probably just a handling of dtReal values elsewhere in the compiler.

Mar 26 '25 17:03 bradcray

@Iainmon : After writing this up this morning, I spent the day wondering if there were surprises I wasn't anticipating, and had a few minutes at the end of the day to get started on it, and then a spare hour at home during which I managed to get the following program compilng:

var pi16: real(16) = 3.14159265;

writeln(pi16);

using this branch https://github.com/bradcray/chapel/tree/float16 with both the C and LLVM back-ends.

Mar 27 '25 04:03 bradcray

@damianmoz / @Iainmon : I've got my branch passing the existing testing system now, but haven't added any tests of real(16)/imag(16)/complex(32) as of yet. It feels a bit daunting and I was curious whether either of you had thoughts about what should be tested to have confidence that things are working properly.

Mar 28 '25 03:03 bradcray

What git command do I need to use to download this custom version of the compiler.

It looks like copy and convert and I/O is implemented. I assume no arithmetic. In what directory/files are the current tests.

Mar 28 '25 03:03 damianmoz

I haven't added any real(16)/imag(16)/complex(32) tests yet. That's what I was looking for guidance on. :)

To grab a copy of the branch from an existing, clean git repo, you could do:

$ cd $CHPL_HOME
$ git checkout -b bradc-float-16
$ git pull https://github.com/bradcray/chapel/tree/float16

Or, from a brand-new directory, you could probably do:

$ git clone https://github.com/bradcray/chapel/tree/float16

but that would get you a complete new copy of the whole tree and its history.

Mar 28 '25 03:03 bradcray

define/initialise as param
define/initialise as ** const**
define as const and var
split init
copy into var
pass as parameter to a routine
return as function result or as an out argument
transmute to/from uint(16)
read and write
define as config const and test
to/from real(32) and real(64) and likewise complex(?w)
copy to/from array elements
define as array elements, e.g. const t = [ 1.0, 2.0 ]:real(16);

Is that enough for now. I need to avoid anything with arithmetic.

Mar 28 '25 03:03 damianmoz

If you can wait until the weekend and rely on my installation going flawlessly, I can put together some simple tests like the above.

Mar 28 '25 04:03 damianmoz

I'm in no rush, and that'd be great!

Mar 28 '25 04:03 bradcray

Have you made it open to the world? I get

fatal: repository 'https://github.com/bradcray/chapel/tree/float16/' not found

Mar 28 '25 04:03 damianmoz

Hmm, yes, but my ability to enter the right git commands is impaired. How about:

$ git checkout -b bradcray-float16 main
$ git pull https://github.com/bradcray/chapel.git float16

Mar 28 '25 04:03 bradcray

The first command won't work unit it know to what repository it needs to go.

Mar 28 '25 04:03 damianmoz

chapel chapel copied to clipboard

Lesser precision float (real(16))

chapel
chapel copied to clipboard