DataAPI.jl
DataAPI.jl copied to clipboard
Add flatten to DataAPI.jl
Both SplitApplyCombine.jl and DataFrames.jl export flatten
. I would add it to DataAPI.jl. The question is what docstring it should have? Maybe something like:
Flatten collection of collections into a single collection
Is enough?
@andyferris - after this is established maybe you could add DataAPI.jl to SplitApplyCombine.jl as a dependency and make innerjoin
and flatten
implement this interface? Then SplitApplyCombine.jl and DataFrames.jl could be used together more easily.
Cool - this is an interesting package. I can see how this could remove friction for users.
Just a thought - for functions that are widely useful, are present in other languages standard libraries, and have an unambiguous definition for Vector
(like mapmany
and flatten
) could we first attempt to put them into Base
?
I think it might be possible, but I think that adding things to Julia Base has been recently restricted a lot. Also, even if we added them, they would most likely not go into the next Julia LTS, which means that we would wait several years to be sure everyone has it in Julia Base. I think the benefit of DataAPI.jl is that it allows for a much quicker development cycle as we would only add e.g.:
function flatten end
here so we do not have to promise any specific API (except for specification of a general meaning of the function).
Am I right that SplitApplyCombine.flatten(x)
is equivalent to collect(Iterators.flatten(x))
?
DataFrames.flatten(df, cols)
is a bit different I would say. In particular, if we consider data frames as collections of rows, SplitApplyCombine.flatten(df)
should return a (flat) collection of all cells in df
. The flatten(df, cols)
method doesn't fit very well in that approach -- though it's not incompatible either.
Indeed. The issue is to avoid name clashes when both DataFrames.jl and SplitApplyCombine.jl are both loaded in a session (which is relatively common for advanced usage scenarios). What would you do in such a case?
That sounds right.
I assumed you used flatten(gdf)
for nested (grouped) data frames?
The cols
version “feels” like to me a lot some flavour of a SpltApplyCombine.mapmany
call which is what flatten
is ultimately defined as. You are automatically keeping (broadcasting?) the columns which aren’t mentioned, right?
I assumed you used
flatten(gdf)
for nested (grouped) data frames?
It is just DataFrame
constructor
You are automatically keeping (broadcasting?) the columns which aren’t mentioned, right?
Right