DimensionalData.jl
DimensionalData.jl copied to clipboard
Integer overflow on mean
Taking the elementwise mean of a vector of Integer DimArrays overflows integer bounds
Base:
julia> M= fill(UInt16(32000), 2)
2-element Vector{UInt16}:
0x7d00
0x7d00
julia> mean([M, M]) # sum < maxval
2-element Vector{Float64}:
32000.0
32000.0
julia> mean([M, M, M]) # sum > maxval
2-element Vector{Float64}:
32000.0
32000.0
Whether the sum is greater or less than the integer limits, the answer is correct
Equivalent math with DimArrays:
julia> D = DimArray(M, X(1:2))
╭──────────────────────────────╮
│ 2-element DimArray{UInt16,1} │
├──────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 0x7d00
2 0x7d00
julia> mean([D, D]) # sum < maxval
╭───────────────────────────────╮
│ 2-element DimArray{Float64,1} │
├───────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 32000.0
2 32000.0
julia> mean([D, D, D]) # sum > maxval
╭───────────────────────────────╮
│ 2-element DimArray{Float64,1} │
├───────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Sampled{Int64} 1:2 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 10154.7
2 10154.7
A different answer for mean of two 32000s and three 32000s where the maxval on the int is 65535.
Guessing at this, but is the order of operations: Base: convert to float |> sum |> divide by N DimensionalData: sum |> convert to float |> divide by N
taking the mean of a DimArray by itself does not have any overflow issues.
julia> D3 = DimArray(fill(UInt16(32000),3), X(1:3))
╭──────────────────────────────╮
│ 3-element DimArray{UInt16,1} │
├──────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────── dims ┐
↓ X Sampled{Int64} 1:3 ForwardOrdered Regular Points
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
1 0x7d00
2 0x7d00
3 0x7d00
julia> mean(D3)
32000.0
You order of ops does look like what is happening. But we don't actually do any of the math here.
So this is the difference between a Base specialisation on Vector and the fallback for AbstractArray ? Or something like that.
May be worth a base Julia issue. The sum reduction needs to have init=fill!(similar(A, Float64), 0.0) to force the conversion to Float64. Maybe they don't because that will fail on some array types like StaticArrays.jl.
I followed it into Statistics https://github.com/JuliaStats/Statistics.jl/blob/d49c2bf4f81e1efb4980a35fe39c815ef8396297/src/Statistics.jl#L179-L198
So I think the issue comes out of a difference the results of the type conversion. I've removed the identity, sum and convert:
julia> M= fill(UInt16(32000), 2);
julia> D = DimArray(M, X(1:2));
julia> promote_type(typeof(M/1), typeof(M))
Vector{Float64} (alias for Array{Float64, 1})
julia> promote_type(typeof(D/1), typeof(D))
DimVector{T, Tuple{X{Sampled{Int64, UnitRange{Int64}, ForwardOrdered, Regular{Int64}, Points, NoMetadata}}}, Tuple{}, A, Na, NoMetadata} where {T, A<:AbstractVector{T}, Na} (alias for DimArray{T, 1, Tuple{X{DimensionalData.Dimensions.Lookups.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.Lookups.ForwardOrdered, DimensionalData.Dimensions.Lookups.Regular{Int64}, DimensionalData.Dimensions.Lookups.Points, DimensionalData.Dimensions.Lookups.NoMetadata}}}, Tuple{}, A, Na,
DimensionalData.Dimensions.Lookups.NoMetadata} where {T, A<:AbstractArray{T, 1}, Na})
# Note:
# M isa Array{UInt16,1}
# M/1 isa Array{Float64,1}
# D isa DimArray{UInt16,1}
# D/1 isa DimArray{Float64,1}
I interpret this as
promote_type(Vector{Float64}, Vector{Int64}) => Vector{Float64}
promote_type(DimVector{Float64}, DimVector{Int64}) => DimVector{Any}
instead of DimVector{Float64}, so maybe the following convert doesn't do anything?
Ok interesting, yes Any won't help. Maybe we are missing a promote_type definition.
We could add one that just forwards all the type parameters to promote_type ?
This needs a lot of work so type promotion generally work on dimensions.