Add parallel reduction supports for RowIterator and NamedTupleIterator
This PR implements SplittablesBase.jl interface halve on RowIterator and NamedTupleIterator. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:
using Tables
table = Tables.rows((key = 1:1000, value = randn(1000)))
using FLoops
using UnPack: @unpack
@floop for row in table
@unpack key, value = row
@reduce() do (kmax; key), (vmax; value)
if vmax < value
vmax = value
kmax = key
end
end
end
@show kmax vmax
A tricky part of this PR is that, since SplittablesTesting.test_ordered uses isequal to compare items (rows), I needed to relax isequal to ignore the storage type of columns. The difference is that
isequal(
first(Tables.rows((a = view([0], 1:1),))),
first(Tables.rows((a = [0],))),
)
is false before this PR and true after this PR. I think it makes sense that ColumnsRow to be compared as if they are lowered to NamedTuples. This is also compatible with that the equalities on arrays ignore the type
julia> [0] == view([0], 1:1)
true
What do you think?
Codecov Report
Merging #187 into master will increase coverage by
0.09%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #187 +/- ##
==========================================
+ Coverage 96.71% 96.80% +0.09%
==========================================
Files 6 6
Lines 456 469 +13
==========================================
+ Hits 441 454 +13
Misses 15 15
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/Tables.jl | 92.92% <ø> (ø) |
|
| src/fallbacks.jl | 97.69% <100.00%> (+0.21%) |
:arrow_up: |
| src/namedtuples.jl | 98.27% <100.00%> (+0.06%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 4a875bf...ecab1ac. Read the comment docs.
Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:
- What are some examples of what you could do w/ the implementation here?
- Why only implemented for
NamedTupleIteratorand not rows/columns more generally? - What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?
Hi, thanks for the response and sorry for this late reply.
Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages.
Yes, I understand this and I should've clarified what SplittablesBase.jl is.
Essentially I am hoping halve to be a fundamental infrastructure for parallel processing in Julia in the sense that iterate, at the moment, is a fundamental infrastructure for sequential processing. The goal is to make halve+iterate (or halve+foldl) the interface between the data structures (tables, arrays, dicts, sets, strings, ...) and parallel processing functions (map, reduce, group-by, join, ...).
I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.
- What are some examples of what you could do w/ the implementation here?
The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to Transducers.foldxt that uses halve. For example, you can compute min/max of max/min over columns a, b and c by foldxt(ProductRF(min, max), table |> Map(r -> (max(r.a, r.b, r.c), min(r.a, r.b, r.c)))) in one go (OK, I have no idea when you need this particular function but it's a fun example).
- Why only implemented for
NamedTupleIteratorand not rows/columns more generally?
If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for RowTable. I can't provide an implementation for AbstractColumns (or column table in general) because I'd like to keep halve and iterate consistent in the sense that each halve implementation satisfies what I call "vcat law":
(1) If the original collection is ordered, concatenating the sub-collections returned by
halvemust create a collection that is equivalent to the original collection. More precisely,isequal( vec(collect(collection)), vcat(vec(collect(left)), vec(collect(right))), )must hold.
--- https://juliafolds.github.io/SplittablesBase.jl/dev/#SplittablesBase.halve
- What kind of package is SplittablesBase.jl? Minimal? Lots of changes? What kind of commitment to stability there?
My intention is making it very minimal although I have to put the implementation for Base there. It currently also contains the code for testing. However, the public API is to use it via a shim package SplittablesTesting. So, I can remove it at any point without introducing breaking changes.
I think it's almost 1.0-ready but there is one specification of an optional API amount https://github.com/JuliaFolds/SplittablesBase.jl/issues/31 that I want to clarify before 1.0.
If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak isequal as in this PR (as this is impossible to do outside Tables.jl without a serious type-piracy).
This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package SplittableTables.jl that implements halve() for table iterators?
If it is already an array, the generic fallback in SplittablesBase.jl covers it already.
I'm not sure this is true, given halve doesn't work with DataFrameRows (which is an AbstractVector).
This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package
SplittableTables.jlthat implementshalve()for table iterators?
I think SplittablesBase.jl is fine, given it's a very small dependency.
cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome.