FreqTables.jl
FreqTables.jl copied to clipboard
Add pairwise
It works well overall but there are a few issues:
- When vectors have different types,
promote_typejoin
is used (viabroadcast
) to choose element type.promote_type
would be more appropriate but there is no mechanism to do this in Base. - When skipping missing values, inference isn't able to realize that the result cannot be
missing
. This problem is fixed if I use a positional argument rather than a keyword argument forskipmissing
but it's far from ideal for users. - Since
eachrow(df::DataFrame)
returns aDataFrameRows
objects which is also a table,pairwise(cor, eachrow(df))
still computes correlation between columns rather than between rows. And anywayDataFrameRow
is not accepted bycor
since it's not anAbstractVector
. One needs something like(Vector(r) for r in eachrow(df))
to work around this limitation. - Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling
Tables.istable
. EvenAbstractVector
inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).
The package should probably be renamed if we want to include this as it's clearly not a frequency table.
I like it very much. The only question is if it would be hard/needed to make it more general and allow not only pairwise but arbitrary dimensional cross-product (not sure if it is needed though).
OK, thanks. It shouldn't be hard to support more types when we don't skip missing values, but it's more tricky when skipping them. Not sure it's worth adding that complexity right now give the intended use cases. EDIT: sorry I hadn't understood your point (I thought it was about allowing any iterators and not just vectors). Yes I imagine it would be possible to accept more than two arguments, but then it wouldn't be pairwise anymore, would it? I guess it would be outer
. :-)
Maybe just anticipate the use for more than 2 arguments and rename it, without implementing it yet.
But I find it much more user-friendly to tell people "to compute pairwise correlation, use pairwise(cor, df)
" than "use outer(cor, df)
". Also pairwise
is already the name used in Distances for this operation and I don't see it moving to outer
.
I like calling it pairwise
, because it does some specific things which make sense in the statistical/data context and not so much as generic map on all tuples (e.g. dealing with cor
and missing
).
I'm fine adding pairwise
to DataAPI.jl if that'd be helpful.
@dkarrasch I don't know whether you've seen this. Ideally this would cohabit nicely with Distances.jl, but the new support for passing vectors of vectors there (https://github.com/JuliaStats/Distances.jl/pull/188 and https://github.com/JuliaStats/Distances.jl/pull/194) will create ambiguities. Not sure what to do about it.
No, I haven't seen this. The issue is that we relax types in those PRs, right? Is there any way you could depend on Distances here, and then define your own metric types that have fields corresponding to the keyword arguments you use here, and finally extend Distances.pairwise
?
The problem is that Tables.jl objects don't have a particular type. Any object can be a table, including notably vectors or iterators of names tuples. As long as Distances only defines methods for matrices or vectors, the conflict with this PR is quite limited: only Tables.jl objects that are also vectors are problematic (including Vector{<: NamedTuple}}
, but I can live with that). But if Distances accepts any iterator (https://github.com/JuliaStats/Distances.jl/pull/194) then the method from this PR will never get called when the first argument is a metric supported by Distances, since Distances methods are more specific. So even depending on Distances here wouldn't fix the dispatch problem.
The only solution AFAICT is to have the most general function check whether an input is a table in the Tables.jl sense, and adapt its behavior depending on that. But @KristofferC didn't like Distances depend on Tables.jl (https://github.com/JuliaStats/Distances.jl/pull/123), and there's the additional problem that we also want to return a NamedArray
(or a similar type) in that case, which adds another dependency.
We could maybe work around this problem by having Distances check that the eltype is <: Union{Number, Missing}
and if not bail out to a function that could live in DataAPI and that this PR would provide when the package is loaded (might be FreqTables or another more general package).
At any rate these issues are tricky. But I think the stats ecosystem would really benefit from having a general pairwise
method that works with any function, with Distances providing optimized support for the metrics it defines.
Maybe for the time being we might require FreqTables.pairwise
and Distances.pairwise
to be written if both packages are loaded (which is somewhat likely but not super likely)? Essentially - making the pairwise
function to be owned by the package.
@dkarrasch Regarding your comment at https://github.com/JuliaStats/Distances.jl/pull/188#pullrequestreview-550171749 (better keep the discussion in a single place):
@nalimilan I wonder if it helps FreqTables.jl if we made these pairwise methods fully generic for iterators, and removed the type constraints on a and b. Then the Distances.jl pairwise method would act metric::PreMetric.
Making pairwise
fully generic for iterators is what https://github.com/JuliaStats/Distances.jl/pull/194 does, right? I don't think that would help for the reasons I gave above: Distances would still define the most specific method, so the one in this PR wouldn't have a chance of catching e.g. pairwise(Euclidean(), DataFrame(...))
.
After discussing this on Slack with @bkamins and @quinnj I think the way forward is to require passing iterators over rows or columns of Tables.jl objects explicitly, via Tables.AbstractRows
and Tables.AbstractColumns
, or Tables.jl objects wrapped in explicit types. Then we can have fallback methods taking any iterator in StatsBase and Distances, with a common empty function definition in a package like DataAPI or a future StatsAPI. See https://github.com/JuliaStats/StatsBase.jl/pull/627 for the StatsBase part.