FreqTables.jl icon indicating copy to clipboard operation
FreqTables.jl copied to clipboard

Add pairwise

Open nalimilan opened this issue 4 years ago • 12 comments

It works well overall but there are a few issues:

  • When vectors have different types, promote_typejoin is used (via broadcast) to choose element type. promote_type would be more appropriate but there is no mechanism to do this in Base.
  • When skipping missing values, inference isn't able to realize that the result cannot be missing. This problem is fixed if I use a positional argument rather than a keyword argument for skipmissing but it's far from ideal for users.
  • Since eachrow(df::DataFrame) returns a DataFrameRows objects which is also a table, pairwise(cor, eachrow(df)) still computes correlation between columns rather than between rows. And anyway DataFrameRow is not accepted by cor since it's not an AbstractVector. One needs something like (Vector(r) for r in eachrow(df)) to work around this limitation.
  • Since Tables.jl objects can be of any type, we must have single method that performs dispatch internally by calling Tables.istable. Even AbstractVector inputs can be either vectors of vectors or row-oriented or column-oriented tables. This means the method that returns a named array and the method that returns a plain matrix have to live in the same package (which has to depend on NamedArrays).

The package should probably be renamed if we want to include this as it's clearly not a frequency table.

nalimilan avatar Nov 12 '20 21:11 nalimilan

I like it very much. The only question is if it would be hard/needed to make it more general and allow not only pairwise but arbitrary dimensional cross-product (not sure if it is needed though).

bkamins avatar Nov 14 '20 18:11 bkamins

OK, thanks. It shouldn't be hard to support more types when we don't skip missing values, but it's more tricky when skipping them. Not sure it's worth adding that complexity right now give the intended use cases. EDIT: sorry I hadn't understood your point (I thought it was about allowing any iterators and not just vectors). Yes I imagine it would be possible to accept more than two arguments, but then it wouldn't be pairwise anymore, would it? I guess it would be outer. :-)

nalimilan avatar Nov 14 '20 18:11 nalimilan

Maybe just anticipate the use for more than 2 arguments and rename it, without implementing it yet.

mschauer avatar Nov 19 '20 10:11 mschauer

But I find it much more user-friendly to tell people "to compute pairwise correlation, use pairwise(cor, df)" than "use outer(cor, df)". Also pairwise is already the name used in Distances for this operation and I don't see it moving to outer.

nalimilan avatar Nov 19 '20 10:11 nalimilan

I like calling it pairwise, because it does some specific things which make sense in the statistical/data context and not so much as generic map on all tuples (e.g. dealing with cor and missing).

mschauer avatar Nov 19 '20 11:11 mschauer

I'm fine adding pairwise to DataAPI.jl if that'd be helpful.

quinnj avatar Nov 20 '20 06:11 quinnj

@dkarrasch I don't know whether you've seen this. Ideally this would cohabit nicely with Distances.jl, but the new support for passing vectors of vectors there (https://github.com/JuliaStats/Distances.jl/pull/188 and https://github.com/JuliaStats/Distances.jl/pull/194) will create ambiguities. Not sure what to do about it.

nalimilan avatar Dec 08 '20 08:12 nalimilan

No, I haven't seen this. The issue is that we relax types in those PRs, right? Is there any way you could depend on Distances here, and then define your own metric types that have fields corresponding to the keyword arguments you use here, and finally extend Distances.pairwise?

dkarrasch avatar Dec 08 '20 09:12 dkarrasch

The problem is that Tables.jl objects don't have a particular type. Any object can be a table, including notably vectors or iterators of names tuples. As long as Distances only defines methods for matrices or vectors, the conflict with this PR is quite limited: only Tables.jl objects that are also vectors are problematic (including Vector{<: NamedTuple}}, but I can live with that). But if Distances accepts any iterator (https://github.com/JuliaStats/Distances.jl/pull/194) then the method from this PR will never get called when the first argument is a metric supported by Distances, since Distances methods are more specific. So even depending on Distances here wouldn't fix the dispatch problem.

The only solution AFAICT is to have the most general function check whether an input is a table in the Tables.jl sense, and adapt its behavior depending on that. But @KristofferC didn't like Distances depend on Tables.jl (https://github.com/JuliaStats/Distances.jl/pull/123), and there's the additional problem that we also want to return a NamedArray (or a similar type) in that case, which adds another dependency.

We could maybe work around this problem by having Distances check that the eltype is <: Union{Number, Missing} and if not bail out to a function that could live in DataAPI and that this PR would provide when the package is loaded (might be FreqTables or another more general package).

At any rate these issues are tricky. But I think the stats ecosystem would really benefit from having a general pairwise method that works with any function, with Distances providing optimized support for the metrics it defines.

nalimilan avatar Dec 10 '20 13:12 nalimilan

Maybe for the time being we might require FreqTables.pairwise and Distances.pairwise to be written if both packages are loaded (which is somewhat likely but not super likely)? Essentially - making the pairwise function to be owned by the package.

bkamins avatar Dec 10 '20 15:12 bkamins

@dkarrasch Regarding your comment at https://github.com/JuliaStats/Distances.jl/pull/188#pullrequestreview-550171749 (better keep the discussion in a single place):

@nalimilan I wonder if it helps FreqTables.jl if we made these pairwise methods fully generic for iterators, and removed the type constraints on a and b. Then the Distances.jl pairwise method would act metric::PreMetric.

Making pairwise fully generic for iterators is what https://github.com/JuliaStats/Distances.jl/pull/194 does, right? I don't think that would help for the reasons I gave above: Distances would still define the most specific method, so the one in this PR wouldn't have a chance of catching e.g. pairwise(Euclidean(), DataFrame(...)).

nalimilan avatar Dec 11 '20 16:12 nalimilan

After discussing this on Slack with @bkamins and @quinnj I think the way forward is to require passing iterators over rows or columns of Tables.jl objects explicitly, via Tables.AbstractRows and Tables.AbstractColumns, or Tables.jl objects wrapped in explicit types. Then we can have fallback methods taking any iterator in StatsBase and Distances, with a common empty function definition in a package like DataAPI or a future StatsAPI. See https://github.com/JuliaStats/StatsBase.jl/pull/627 for the StatsBase part.

nalimilan avatar Jan 01 '21 16:01 nalimilan