colwise!(dist, res, A, B) does not consider eltype(res)
Here's a test case (Julia 1.11.2, Distances.jl 0.10.12):
X = reshape(Int16[0, 0], 2, 1)
Y = reshape(Int16[1000, 1000], 2, 1)
julia> colwise!(SqEuclidean(), Matrix{Float64}(undef, 1, 1), X, Y)
1×1 Matrix{Float64}:
-31616.0
julia> pairwise!(SqEuclidean(), Matrix{Float64}(undef, 1, 1), X, Y, dims=2)
1×1 Matrix{Float64}:
33920.0
I think it's rather the other way around: it's pure luck (and due to a specific implementation) that pairwise! works here:
julia> SqEuclidean()(vec(X), vec(Y))
-31616
But we may transfer the same trick from pairwise! to colwise!.
Maybe the core method for the distance API should be dist(T, x, y) with dist(x, y) = dist(result_type(x, y), x, y)
But what would that help? For your Int16 example, you'd get result_type(x, y) = Int16, and there you get your overflow. To prevent it, you would need to know the target type in the middle of the computation, which is as good as knowing it upfront, where you can change the eltype of your input to begin with.
That will help with colwise!()-like methods, where the eltype of the result is known in advance (but currently it is ignored): colwise!() will call dist(T, x, y) directly.
In other situations the current behaviour (overflow) will be preserved.
To prevent the overflow, you would need to promote your input (as you can see from your OP), which you can do as the user/package author. I don't think we want to silently promote eltypes here.
These matrices might be large, so promoting Int16 to Float64 just to tell colwise!() how to calculate the distance (which it is currently doing element-wise) is a bit of an overkill.
Also at the moment pairwise!() and colwise!() behaviors are different.
I agree that probably there's no "best" automatic behavior of how to convert, but ideally Distances.jl can provide an API that allows controlling it, so that the users don't have to implement workarounds.
Also at the moment
pairwise!()andcolwise!()behaviors are different.
That's why I said the fact that pairwise! works is rather luck. It's due to this performance optimization, which doesn't make sense for colwise!:
https://github.com/JuliaStats/Distances.jl/blob/35c6d0dfec5b9b93481f006dbf8c1d0496ecf5f6/src/metrics.jl#L616-L617
So, what you're asking for is not about controlling the target type, but a preprocessing of the data, which you can currently do like:
X = reshape(Int16[0, 0], 2, 1)
Y = reshape(Int16[1000, 1000], 2, 1)
using Distances
map((x, y) -> sqeuclidean(float(x), float(y)), eachcol(X), eachcol(Y))
map!((x, y) -> sqeuclidean(float(x), float(y)), Matrix{Float64}(undef, 1, 1), eachcol(X), eachcol(Y))
Note that the pairwise! computation is also incorrect, for the same reason. It doesn't preprocess the data, so calculates with Int16, for which Int16(1000)^2 is already incorrect.
Thank you for the detailed explanations!
Implementing the workaround in the user script is probably the optimal strategy now, but look at it from the perspective of the other packages, in particular Clustering.jl.
- The input data may be of different eltypes. I think we agree that implementing the logic to properly handle all eltype and metric permutations in Clustering.jl is not a good idea.
- Also, the input data could be very large, and the users might be already struggling with the memory usage. So just internally converting the data into an appropriate eltype is not an option. Conversion could be done in the user script. But right now -- if the user is not doing a conversion -- the resulting distances could be incorrect, and the user will have no visibility to that.
It was my assumption that Distances.jl -- in addition to defining the metric types -- also implements the efficient distances calculation for the large datasets, so it could be used as the plug-in solutions to these problems.
AFAIU, the immediate solution is to check whether the distances are negative if SemiMetric is used and warn the user. That could be done on the Clustering.jl side, but it could also be done in Distances.jl since the same problem may manifest itself in other contexts.