DimensionalData.jl icon indicating copy to clipboard operation
DimensionalData.jl copied to clipboard

Integer overflow on mean

Open DanDeepPhase opened this issue 1 year ago • 4 comments

Taking the elementwise mean of a vector of Integer DimArrays overflows integer bounds

Base:

julia> M= fill(UInt16(32000), 2)
2-element Vector{UInt16}:
 0x7d00
 0x7d00

julia> mean([M, M])               # sum < maxval
2-element Vector{Float64}:
 32000.0
 32000.0

julia> mean([M, M, M])          # sum > maxval
2-element Vector{Float64}:
 32000.0
 32000.0

Whether the sum is greater or less than the integer limits, the answer is correct

Equivalent math with DimArrays:

julia> D = DimArray(M, X(1:2))
╭──────────────────────────────╮
│ 2-element DimArray{UInt16,1} │
├──────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 1  0x7d00
 2  0x7d00

julia> mean([D, D])          # sum < maxval
╭───────────────────────────────╮
│ 2-element DimArray{Float64,1} │
├───────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 1  32000.0
 2  32000.0

julia> mean([D, D, D])          # sum > maxval
╭───────────────────────────────╮
│ 2-element DimArray{Float64,1} │
├───────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 1  10154.7
 2  10154.7

A different answer for mean of two 32000s and three 32000s where the maxval on the int is 65535.

Guessing at this, but is the order of operations: Base: convert to float |> sum |> divide by N DimensionalData: sum |> convert to float |> divide by N

taking the mean of a DimArray by itself does not have any overflow issues.

julia> D3 = DimArray(fill(UInt16(32000),3), X(1:3))
╭──────────────────────────────╮
│ 3-element DimArray{UInt16,1} │
├──────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
  ↓ X Sampled{Int64} 1:3 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 1  0x7d00
 2  0x7d00
 3  0x7d00

julia> mean(D3)
32000.0

DanDeepPhase avatar Nov 15 '24 21:11 DanDeepPhase

You order of ops does look like what is happening. But we don't actually do any of the math here.

So this is the difference between a Base specialisation on Vector and the fallback for AbstractArray ? Or something like that.

May be worth a base Julia issue. The sum reduction needs to have init=fill!(similar(A, Float64), 0.0) to force the conversion to Float64. Maybe they don't because that will fail on some array types like StaticArrays.jl.

rafaqz avatar Nov 16 '24 09:11 rafaqz

I followed it into Statistics https://github.com/JuliaStats/Statistics.jl/blob/d49c2bf4f81e1efb4980a35fe39c815ef8396297/src/Statistics.jl#L179-L198

So I think the issue comes out of a difference the results of the type conversion. I've removed the identity, sum and convert:

julia> M= fill(UInt16(32000), 2);
julia> D = DimArray(M, X(1:2));

julia> promote_type(typeof(M/1), typeof(M))
Vector{Float64} (alias for Array{Float64, 1})

julia> promote_type(typeof(D/1), typeof(D))
DimVector{T, Tuple{X{Sampled{Int64, UnitRange{Int64}, ForwardOrdered, Regular{Int64}, Points, NoMetadata}}}, Tuple{}, A, Na, NoMetadata} where {T, A<:AbstractVector{T}, Na} (alias for DimArray{T, 1, Tuple{X{DimensionalData.Dimensions.Lookups.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Regular{Int64}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}}, Tuple{}, A, Na, 
DimensionalData.Dimensions.Lookups.NoMetadata} where {T, A<:AbstractArray{T, 1}, Na})

# Note:
# M   isa Array{UInt16,1}
# M/1 isa Array{Float64,1}
# D   isa DimArray{UInt16,1}
# D/1 isa DimArray{Float64,1}

I interpret this as

promote_type(Vector{Float64}, Vector{Int64}) => Vector{Float64}
promote_type(DimVector{Float64}, DimVector{Int64}) => DimVector{Any}

instead of DimVector{Float64}, so maybe the following convert doesn't do anything?

DanDeepPhase avatar Nov 18 '24 22:11 DanDeepPhase

Ok interesting, yes Any won't help. Maybe we are missing a promote_type definition.

We could add one that just forwards all the type parameters to promote_type ?

rafaqz avatar Nov 18 '24 23:11 rafaqz

This needs a lot of work so type promotion generally work on dimensions.

rafaqz avatar Jan 30 '25 18:01 rafaqz