This PR implements SplittablesBase.jl interface halve on RowIterator and NamedTupleIterator. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:

using Tables
table = Tables.rows((key = 1:1000, value = randn(1000)))

using FLoops
using UnPack: @unpack

@floop for row in table
    @unpack key, value = row
    @reduce() do (kmax; key), (vmax; value)
        if vmax < value
            vmax = value
            kmax = key
        end
    end
end
@show kmax vmax

A tricky part of this PR is that, since SplittablesTesting.test_ordered uses isequal to compare items (rows), I needed to relax isequal to ignore the storage type of columns. The difference is that

isequal(
    first(Tables.rows((a = view([0], 1:1),))),
    first(Tables.rows((a = [0],))),
)

is false before this PR and true after this PR. I think it makes sense that ColumnsRow to be compared as if they are lowered to NamedTuples. This is also compatible with that the equalities on arrays ignore the type

julia> [0] == view([0], 1:1)
true

What do you think?

Aug 10 '20 00:08 tkf

Codecov Report

Merging #187 into master will increase coverage by 0.09%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
+ Coverage   96.71%   96.80%   +0.09%     
==========================================
  Files           6        6              
  Lines         456      469      +13     
==========================================
+ Hits          441      454      +13     
  Misses         15       15

Impacted Files	Coverage Δ
src/Tables.jl	`92.92% <ø> (ø)`
src/fallbacks.jl	`97.69% <100.00%> (+0.21%)`	:arrow_up:
src/namedtuples.jl	`98.27% <100.00%> (+0.06%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4a875bf...ecab1ac. Read the comment docs.

Aug 10 '20 00:08 codecov[bot]

Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:

What are some examples of what you could do w/ the implementation here?
Why only implemented for NamedTupleIterator and not rows/columns more generally?
What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

Aug 11 '20 04:08 quinnj

Hi, thanks for the response and sorry for this late reply.

Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages.

Yes, I understand this and I should've clarified what SplittablesBase.jl is.

Essentially I am hoping halve to be a fundamental infrastructure for parallel processing in Julia in the sense that iterate, at the moment, is a fundamental infrastructure for sequential processing. The goal is to make halve+iterate (or halve+foldl) the interface between the data structures (tables, arrays, dicts, sets, strings, ...) and parallel processing functions (map, reduce, group-by, join, ...).

I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.

What are some examples of what you could do w/ the implementation here?

The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to Transducers.foldxt that uses halve. For example, you can compute min/max of max/min over columns a, b and c by foldxt(ProductRF(min, max), table |> Map(r -> (max(r.a, r.b, r.c), min(r.a, r.b, r.c)))) in one go (OK, I have no idea when you need this particular function but it's a fun example).

Why only implemented for NamedTupleIterator and not rows/columns more generally?

If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for RowTable. I can't provide an implementation for AbstractColumns (or column table in general) because I'd like to keep halve and iterate consistent in the sense that each halve implementation satisfies what I call "vcat law":

(1) If the original collection is ordered, concatenating the sub-collections returned by halve must create a collection that is equivalent to the original collection. More precisely,
isequal(
    vec(collect(collection)),
    vcat(vec(collect(left)), vec(collect(right))),
)
must hold.

--- https://juliafolds.github.io/SplittablesBase.jl/dev/#SplittablesBase.halve

What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?

My intention is making it very minimal although I have to put the implementation for Base there. It currently also contains the code for testing. However, the public API is to use it via a shim package SplittablesTesting. So, I can remove it at any point without introducing breaking changes.

I think it's almost 1.0-ready but there is one specification of an optional API amount https://github.com/JuliaFolds/SplittablesBase.jl/issues/31 that I want to clarify before 1.0.

If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak isequal as in this PR (as this is impossible to do outside Tables.jl without a serious type-piracy).

Aug 20 '20 06:08 tkf

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

Dec 04 '21 02:12 mattwigway

If it is already an array, the generic fallback in SplittablesBase.jl covers it already.

I'm not sure this is true, given halve doesn't work with DataFrameRows (which is an AbstractVector).

This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?

I think SplittablesBase.jl is fine, given it's a very small dependency.

cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome.

Jun 23 '23 01:06 ParadaCarleton

Add parallel reduction supports for RowIterator and NamedTupleIterator

Codecov Report