conversion
Just found this
julia> Float8(0.015)
Float8(0.0625)
julia> Float8(0.02)
Float8(0.015625)
although 0.015 <= 0.02 that's not true after conversion. Problem only occurs for subnormals. It's not the conversion back to Float32 (triggered by show):
julia> bitstring(Float8(0.015))
"00000100"
julia> bitstring(Float8(0.02))
"00000001"
@JeffreySarnoff do you have an idea? There's something wrong with the rounding for subnormals I assume...
https://github.com/milankl/Float8s.jl/blob/9345556ac3d11f61bef2c70f49947113b8949066/src/float8.jl#L164-L193
I'll look at it tonight.
still looking, meanwhile
Base.Float32(x::Float8) = Float8toFloat32[reinterpret(UInt8,x) + 0x01]
const Float8toFloat32 = Float32[0.0, 0.015625, 0.03125, 0.046875, 0.0625, 0.078125, 0.09375, 0.109375, 0.125, 0.140625, 0.15625, 0.171875, 0.1875, 0.203125, 0.21875, 0.234375, 0.25, 0.265625, 0.28125, 0.296875, 0.3125, 0.328125, 0.34375, 0.359375, 0.375, 0.390625, 0.40625, 0.421875, 0.4375, 0.453125, 0.46875, 0.484375, 0.5, 0.53125, 0.5625, 0.59375, 0.625, 0.65625, 0.6875, 0.71875, 0.75, 0.78125, 0.8125, 0.84375, 0.875, 0.90625, 0.9375, 0.96875, 1.0, 1.0625, 1.125, 1.1875, 1.25, 1.3125, 1.375, 1.4375, 1.5, 1.5625, 1.625, 1.6875, 1.75, 1.8125, 1.875, 1.9375, 2.0, 2.125, 2.25, 2.375, 2.5, 2.625, 2.75, 2.875, 3.0, 3.125, 3.25, 3.375, 3.5, 3.625, 3.75, 3.875, 4.0, 4.25, 4.5, 4.75, 5.0, 5.25, 5.5, 5.75, 6.0, 6.25, 6.5, 6.75, 7.0, 7.25, 7.5, 7.75, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, 15.0, 15.5, Inf32, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, -0.0, -0.015625, -0.03125, -0.046875, -0.0625, -0.078125, -0.09375, -0.109375, -0.125, -0.140625, -0.15625, -0.171875, -0.1875, -0.203125, -0.21875, -0.234375, -0.25, -0.265625, -0.28125, -0.296875, -0.3125, -0.328125, -0.34375, -0.359375, -0.375, -0.390625, -0.40625, -0.421875, -0.4375, -0.453125, -0.46875, -0.484375, -0.5, -0.53125, -0.5625, -0.59375, -0.625, -0.65625, -0.6875, -0.71875, -0.75, -0.78125, -0.8125, -0.84375, -0.875, -0.90625, -0.9375, -0.96875, -1.0, -1.0625, -1.125, -1.1875, -1.25, -1.3125, -1.375, -1.4375, -1.5, -1.5625, -1.625, -1.6875, -1.75, -1.8125, -1.875, -1.9375, -2.0, -2.125, -2.25, -2.375, -2.5, -2.625, -2.75, -2.875, -3.0, -3.125, -3.25, -3.375, -3.5, -3.625, -3.75, -3.875, -4.0, -4.25, -4.5, -4.75, -5.0, -5.25, -5.5, -5.75, -6.0, -6.25, -6.5, -6.75, -7.0, -7.25, -7.5, -7.75, -8.0, -8.5, -9.0, -9.5, -10.0, -10.5, -11.0, -11.5, -12.0, -12.5, -13.0, -13.5, -14.0, -14.5, -15.0, -15.5, -Inf32, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN];
Yeah great, I always forget how small 8bit actually is. Will check how much faster that is!
Yep, a lookup table made it about 2x faster. I've added an @inbounds which gave another 2x speed-up.
Base.Float32(x::Float8) = @inbounds Float8toFloat32[reinterpret(UInt8,x) + 0x01]
The table method does not appear to round properly.
I put this together. The only function to call directly
is Float8(x::Float32) which accepts a Float32 value
and yields the corresponding (round nearest) Float8.
# the constants
# ---------------
#=
One table is split into subsections of 16 entries each.
This keeps the scan time minimal, even after accounting
for the conditionals that select the proper subsection.
These are Float8 values offset by 1/2 way to the next value,
this lets `findfirst` return an appropriately rounded value.
The `F8offsetN` tuples are used with normal values,
values that are finite and are not subnormal. Values
that round to zero are handled before these are used.
=#
const F8offset1 = (Float32[
0.2578125, 0.2734375, 0.2890625, 0.3046875, 0.3203125, 0.3359375,
0.3515625, 0.3671875, 0.3828125, 0.3984375, 0.4140625, 0.4296875,
0.4453125, 0.4609375, 0.4765625, 0.4921875]...,)
const F8offset2 = (Float32[
0.515625, 0.546875, 0.578125, 0.609375, 0.640625, 0.671875,
0.703125, 0.734375, 0.765625, 0.796875, 0.828125, 0.859375,
0.890625, 0.921875, 0.953125, 0.984375]...,);
const F8offset3 = (Float32[
1.03125, 1.09375, 1.15625, 1.21875, 1.28125, 1.34375, 1.40625,
1.46875, 1.53125, 1.59375, 1.65625, 1.71875, 1.78125, 1.84375,
1.90625, 1.96875]...,);
const F8offset4 = (Float32[
2.0625, 2.1875, 2.3125, 2.4375, 2.5625, 2.6875, 2.8125,
2.9375, 3.0625, 3.1875, 3.3125, 3.4375, 3.5625, 3.6875,
3.8125, 3.9375]...,);
const F8offset5 = (Float32[
4.125, 4.375, 4.625, 4.875, 5.125, 5.375, 5.625, 5.875, 6.125,
6.375, 6.625, 6.875, 7.125, 7.375, 7.625, 7.875]...,);
const F8offset6 = (Float32[
8.25, 8.75, 9.25, 9.75, 10.25, 10.75, 11.25, 11.75, 12.25, 12.75,
13.25, 13.75, 14.25, 14.75, 15.25, 15.75]...,);
#=
There is one table used with subnormal values.
It is derived from the actual values of each
subnormal quantity, shifted up halfway to the
next subnormal. This lets scanning also round.
An initial (anchor) value is prepended, that value
is half of the smallest subnormal.
A corresponding table of UInt8 values also is used.
=#
const F8offset_subnormal = (
0.0078125f0, 0.0234375f0, 0.0390625f0, 0.0546875f0, 0.0703125f0,
0.0859375f0, 0.1015625f0, 0.1171875f0, 0.1328125f0, 0.1484375f0,
0.1640625f0, 0.1796825f0, 0.1953125f0, 0.2109375f0, 0.2265625f0,
0.2421875f0)
const U8subnormal = (collect(UInt8.(0:15))...,)
# some named constants to clarify the source text
const roundsto_floatmax8 = 15.25f0
const roundsto_zero8 = 0.0078125f0
const roundsto_subnormal = 0.2421875f0
const floatmaxplus8 = 15.75f0 # floatmax(Float8) + floatmin(Float8)
const UNaN8 = 0x78
const UInf8 = 0x70
const UFloatmax8 = 0x6f
const UFloatmin8 = 0x10
const UZero8 = 0x00
# the functions
# ----------------
function Float8(x::Float32)
s, absx = signbit(x), abs(x)
ui =toUInt8(x)
return reinterpret(Float8, ui)
end
function toUInt8(x::Float32)
s, absx = signbit(x), abs(x)
isnan(absx) && return s ? UNaN8|0x80 : UNaN8
if absx >= roundsto_floatmax8
if absx > floatmaxplus8
return s ? UInf8|0x80 : UInf8
else
return s ? UFloatmax8|0x80 : UFloatmax8
end
end
if absx < roundsto_zero8
return s ? UZero8|0x80 : UZero8
elseif absx < roundsto_subnormal
return subnormal8(s, absx)
end
absx = min(15.5f0, max(0.25f0, absx))
return normal8(s, absx)
end
@inline function subnormal8(s::Bool, absx::Float32)
idx = findfirst(a->absx <= a, F8offset_subnormal)
return s ? U8subnormal[idx] | 0x80 : U8subnormal[idx]
end
function normal8(s::Bool, absx::Float32)
if absx <= 0.4921875f0
idx = UInt8(15+findfirst(a->absx <= a, F8offset1))
return s ? idx|0x80 : idx
elseif absx <= 0.984375f0
idx = UInt8(15+16+findfirst(a->absx <= a, F8offset2))
return s ? idx|0x80 : idx
elseif absx <= 1.96875f0
idx = UInt8(15+32+findfirst(a->absx <= a, F8offset3))
return s ? idx|0x80 : idx
elseif absx <= 3.9375f0
idx = UInt8(15+32+16+findfirst(a->absx <= a, F8offset4))
return s ? idx|0x80 : idx
elseif absx <= 7.875f0
idx = UInt8(15+64+findfirst(a->absx <= a, F8offset5))
return s ? idx|0x80 : idx
elseif absx <= 15.75f0
idx = UInt8(15+64+16+findfirst(a->absx <= a, F8offset6))
return s ? idx|0x80 : idx
else
throw(DomainError(absx,"not normal for Float8"))
end
end
Amazing Jeffrey, I'll implement that in the next days, thanks so much!
note correction in toUInt8
absx = min(15.5f0, max(0.25f0, absx))
This is yet faster. It uses the fact that the original discrimination ensures any input to normal8 will be found in firstof16lte and splits the search in normal8.
function normal8(s::Bool, absx::Float32)
if absx <= 1.96875f0
if absx <= 0.4921875f0
idx = UInt8(15+firstof16lte(absx, F8offset1))
return s ? idx|0x80 : idx
elseif absx <= 0.984375f0
idx = UInt8(15+16+firstof16lte(absx, F8offset2))
return s ? idx|0x80 : idx
else
idx = UInt8(15+32+firstof16lte(absx, F8offset3))
return s ? idx|0x80 : idx
end
else
if absx <= 3.9375f0
idx = UInt8(15+32+16+firstof16lte(absx, F8offset4))
return s ? idx|0x80 : idx
elseif absx <= 7.875f0
idx = UInt8(15+64+firstof16lte(absx, F8offset5))
return s ? idx|0x80 : idx
else
idx = UInt8(15+64+16+firstof16lte(absx, F8offset6))
return s ? idx|0x80 : idx
end
end
end
function firstof16lte(needle, haystack)
for idx = 1:16
if needle <= haystack[idx]
return idx
end
end
error("should not be reached")
end
Great, yeah, master now contains your version, as mine contained a bug. However, I get considerably slower timings
julia> A = rand(Float32,300,300) .+ 1f0; # no subnormals
julia> @btime Float8.($A); # Jeffrey's version
2.815 ms (2 allocations: 88.02 KiB)
julia> @btime Float8_old.($A); # Milan's version (only correct for normals)
1.652 ms (2 allocations: 88.02 KiB)
julia> @btime Float16.($A);
1.065 ms (2 allocations: 175.89 KiB)
Any idea where this comes from?
You could use a hybrid, yours instead of my normal8 iff isnormal, keeping mine for issubnormal