f16<->double
Your code to convert between f16 and f64 is a bit opaque to me:
double sexp_half_to_double(unsigned short x) {
unsigned int e = (x&0x7C00)>>10,
m = (x&0x03FF)<<13,
v = float_as_int((float)m)>>23;
return int_as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000)));
}
unsigned short sexp_double_to_half(double x) {
unsigned int b = float_as_int(x)+0x00001000,
e = (b&0x7F800000)>>23,
m = b&0x007FFFFF;
return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF;
}
It appears that it may be much faster than my code (which uses flscalbn, flilogb, and many tests and jumps).
I have a test code for my f16->double and double->f16, where the f16 representation is an unsigned 16-bit int.
So two questions:
- Can you give a reference for your code?
- Are the C routines
sexp_double_to_halfandsexp_half_to_doubleexported to the Scheme level, or are they just used internally to supportf16vector-refandf16vector-set?
Thanks.
Brad
Source: https://arxiv.org/abs/2112.08926
It was actually a place holder until I could derive something myself but I may just leave it as is.
There is also some interesting discussion in: https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion. Note one answer is just to convert the 3 float components separately, capping the exponent and rounding/truncating the mantissa depending on direction, then recombining the components:
Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);
Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);
which is in the spirit as the above implementation but a little too simple, not handling denormalized values or infinities, etc.
Thanks. I used Gambit's C interface to test the code.
It seems that the code always rounds away from zero for doubles exactly between two representable numbers when converting to f16 instead of rounding to even.
It also gets the encoding of infinities and NaNs wrong, it should have
> (double->f16 +inf.0)
31744
> (double->f16 +nan.0)
32767
(the second coding isn't unique, but the mantissa needs to be nonzero) while the code returns
> (double->f16 +inf.0)
32767
> (double->f16 +nan.0)
32768
> (double->f16 -0.0)
32768
> (f16->double 32768)
-0.
32767 is the encoding for +nan.0 and 32768 is the encoding for -0.0.
I've put a file with my test data at http://www.math.purdue.edu/~lucier/f16-data.txt .
It contains three lists.
The first is a list of doubles to be converted to f16 codes.
The second is a list of the codes associated with those doubles.
And the third is a list of reconstructed doubles from those codes.
If you negate all the data in the first list, the codes should just increase by 32768 (except perhaps for +nan.0).
I suspect that because of the implicit double->float conversion in sexp_double_to_half, the code might be affected by double rounding, too.
The rounding I'm not too concerned about, but I'll fix the nan handling.
OK, here are some examples where the finiteness of your results and my results differ.
In each row, the last of the three numbers is the input, the first is your result (double->f16->double), the second is my result.
65520.0 should overflow to +inf.0 in this format. The second group has the sign changed withflcopysign from SRFI 144 (so that doesn't make a difference in the print output of +nan.0).
finiteness_difference 131008.0 +inf.0 +inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference 65536.0 +inf.0 65520.0
finiteness_difference -131008.0 -inf.0 -inf.0
finiteness_difference -131008.0 +nan.0 +nan.0
finiteness_difference -65536.0 -inf.0 -65520.0
Sorry, you were asking how to test this and I didn't reply. The code is exposed in the C API, but from Scheme you have to go indirectly via uniform vectors, for example to test the round trip on +inf.0:
$ chibi-scheme -msrfi.160.mini -p'(f16vector-ref #f16(+inf.0) 0)'
131008.0
I figured that out, but it seemed that you could test only double -> f16 code -> double round trips, and if you found a problem you couldn’t tell whether the problem is in the coding or the decoding. That’s why I ended up testing the C code you’re using with Gambit’s C interface.
Srfi 231’s sample implementation has functions for coding and decoding (as parameterized Scheme macros), you might find them of interest.