DataAPI.jl icon indicating copy to clipboard operation
DataAPI.jl copied to clipboard

Add flatten to DataAPI.jl

Open bkamins opened this issue 3 years ago • 6 comments

Both SplitApplyCombine.jl and DataFrames.jl export flatten. I would add it to DataAPI.jl. The question is what docstring it should have? Maybe something like:

Flatten collection of collections into a single collection

Is enough?

@andyferris - after this is established maybe you could add DataAPI.jl to SplitApplyCombine.jl as a dependency and make innerjoin and flatten implement this interface? Then SplitApplyCombine.jl and DataFrames.jl could be used together more easily.

bkamins avatar Jul 04 '21 19:07 bkamins

Cool - this is an interesting package. I can see how this could remove friction for users.

Just a thought - for functions that are widely useful, are present in other languages standard libraries, and have an unambiguous definition for Vector (like mapmany and flatten) could we first attempt to put them into Base?

andyferris avatar Jul 04 '21 23:07 andyferris

I think it might be possible, but I think that adding things to Julia Base has been recently restricted a lot. Also, even if we added them, they would most likely not go into the next Julia LTS, which means that we would wait several years to be sure everyone has it in Julia Base. I think the benefit of DataAPI.jl is that it allows for a much quicker development cycle as we would only add e.g.:

function flatten end

here so we do not have to promise any specific API (except for specification of a general meaning of the function).

bkamins avatar Jul 05 '21 06:07 bkamins

Am I right that SplitApplyCombine.flatten(x) is equivalent to collect(Iterators.flatten(x))?

DataFrames.flatten(df, cols) is a bit different I would say. In particular, if we consider data frames as collections of rows, SplitApplyCombine.flatten(df) should return a (flat) collection of all cells in df. The flatten(df, cols) method doesn't fit very well in that approach -- though it's not incompatible either.

nalimilan avatar Jul 24 '21 14:07 nalimilan

Indeed. The issue is to avoid name clashes when both DataFrames.jl and SplitApplyCombine.jl are both loaded in a session (which is relatively common for advanced usage scenarios). What would you do in such a case?

bkamins avatar Jul 24 '21 14:07 bkamins

That sounds right.

I assumed you used flatten(gdf) for nested (grouped) data frames?

The cols version “feels” like to me a lot some flavour of a SpltApplyCombine.mapmany call which is what flatten is ultimately defined as. You are automatically keeping (broadcasting?) the columns which aren’t mentioned, right?

andyferris avatar Jul 24 '21 21:07 andyferris

I assumed you used flatten(gdf) for nested (grouped) data frames?

It is just DataFrame constructor

You are automatically keeping (broadcasting?) the columns which aren’t mentioned, right?

Right

bkamins avatar Jul 24 '21 21:07 bkamins