DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

a new method of the flatten function in DataFrames

Open sprmnt21 opened this issue 2 years ago • 19 comments

what do you think of the utility of having a method of the flatten function in DataFrames that expands a dataframe type df2 in the form dfexp?

df = DataFrame(a=1:3)
df1=DataFrame(b=11:15)

CSV.write("tmp1.csv", df)
CSV.write("tmp2.csv", df)
CSV.write("tmp3.csv", df1)

julia> df2
3×3 DataFrame
 Row │ files     other_cols  subdf                 
     │ String    Int64       DataFrame             
─────┼─────────────────────────────────────        
   1 │ tmp1.csv           1  3×1 DataFrame         
   2 │ tmp2.csv           2  3×1 DataFrame         
   3 │ tmp3.csv           3  5×1 DataFrame         


julia> dfexp=flatten(df2,:subdf)
11×4 DataFrame
 Row │ files     other_cols  a        b            
     │ String    Int64       Int64?   Int64?       
─────┼────────────────────────────────────────     
   1 │ tmp1.csv           1        1  missing      
   2 │ tmp1.csv           1        2  missing      
   3 │ tmp1.csv           1        3  missing      
   4 │ tmp2.csv           2        1  missing      
   5 │ tmp2.csv           2        2  missing      
   6 │ tmp2.csv           2        3  missing      
   7 │ tmp3.csv           3  missing       11      
   8 │ tmp3.csv           3  missing       12      
   9 │ tmp3.csv           3  missing       13      
  10 │ tmp3.csv           3  missing       14      
  11 │ tmp3.csv           3  missing       15    

this is a function that does what is indicated, just to give an idea of what is required not to suggest how it should be done.

mapreduce(r->crossjoin(DataFrame(r[Not(:subdf)]),r.subdf), (x, y) -> vcat(x, y; cols = :union), eachrow(df2))

sprmnt21 avatar Sep 24 '21 06:09 sprmnt21

As commented on Slack. The operation you request is vcat. The only difference with the standard vcat is that now vcat allows you to create only one indicator column and you want more than one column. We could consider extending the source keyword API to allow what you ask for.

Let us wait for opinions.

bkamins avatar Sep 24 '21 07:09 bkamins

thanks for your attention. Another way of looking at it could be the following. Think of the starting table df2 as if it were the result of a groupby and the table dfexp was the result of some kind of un-group.

sprmnt21 avatar Sep 24 '21 08:09 sprmnt21

But under your thinking all data frames in :subdf columns would have to have exactly the same column names and eltypes. This clearly is not the case in your example. That is why I quote vcat as the crucial difference between vcat and un-group is that vcat allows you do decide how you want to handle cases when column names or eltypes do not match.

bkamins avatar Sep 24 '21 08:09 bkamins

Additionally, I think that allowing ByRow functions in combine to re-size would solve part of this problem.

It would not work if the data frames have different keys, and you need to convert to Tables.columntable (i.e. a NamedTuple) because data frames don''t support keys. But other than that,

combine(df, :c => ByRow(Tables.columntable) => AsTable)

would maybe work.

I'm not sure this is desirable though. This might not be consistent with how combine actually works / should be thought about.

pdeffebach avatar Sep 24 '21 13:09 pdeffebach

This is what I also thought but I have exactly the same reservations.

Now I have realized that actually assuming the columns match in all data frames the following works assuming non-dataframe columns form unique keys (which I guess is expected):

julia> df = DataFrame(x=[1,2,3], y=[DataFrame(p=1:2,q=11:12), DataFrame(p=3:4,q=13:14), DataFrame(p=5:6,q=15:16)])
3×2 DataFrame
 Row │ x      y
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×2 DataFrame
   2 │     2  2×2 DataFrame
   3 │     3  2×2 DataFrame

julia> combine(x -> only(x).y, groupby(df, Not(:y)))
6×3 DataFrame
 Row │ x      p      q
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1     11
   2 │     1      2     12
   3 │     2      3     13
   4 │     2      4     14
   5 │     3      5     15
   6 │     3      6     16

bkamins avatar Sep 24 '21 14:09 bkamins

and if it is not unique:

combine(x -> reduce(vcat, x.y), groupby(df, Not(:y)))

bkamins avatar Sep 24 '21 14:09 bkamins

Here she is. This solution demonstrates the potential of DataFrames to also deal with situations, for which perhaps it was not specifically designed.

sprmnt21 avatar Sep 24 '21 20:09 sprmnt21

Would Tables.dictrowtable meet your requirements?

bkamins avatar Feb 20 '22 15:02 bkamins

Maybe. But you should help me figure out how to fit it to the case.

sprmnt21 avatar Feb 20 '22 16:02 sprmnt21

this is what I can do ...

transform(df2, :sdf =>Tables.dictrowtable(c->mapreduce(x->copy.(eachrow(x)), vcat, c)) => AsTable)

but i get LoadError: ArgumentError: no default Tables.rows implementation for type: var"#83#85"

sprmnt21 avatar Feb 20 '22 17:02 sprmnt21

I meant this:

julia> df = DataFrame(id=1:2, a=[DataFrame(x=1:2), DataFrame(y=3:4)])
2×2 DataFrame
 Row │ id     a
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  2×1 DataFrame
   2 │     2  2×1 DataFrame

julia> select(df, :id, :a => Tables.dictrowtable => AsTable)
2×3 DataFrame
 Row │ id     x        y
     │ Int64  Array…?  Array…?
─────┼─────────────────────────
   1 │     1  [1, 2]   missing
   2 │     2  missing  [3, 4]

and then ideally you would do:

flatten(select(df, :id, :a => Tables.dictrowtable => AsTable), [:x, :y])

but now I see that I need to fix flatten to add support for missing to it. I will open a PR for this.

bkamins avatar Feb 20 '22 22:02 bkamins

Ah - this is not as easy as I have thought.

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

bkamins avatar Feb 20 '22 22:02 bkamins

As a second thought - maybe adding something like unnest from dplyr would be useful?

bkamins avatar Feb 20 '22 22:02 bkamins

another way to get the result using only "table functions" this way flatten is used before dictrowtables and flatten doesn't have to handle missing. Perhaps :-) transform(flatten(transform(df2,:sdf=>ByRow(Tables.rowtable), renamecols=false),:sdf), :sdf=>Tables.dictrowtable=>AsTable)

sprmnt21 avatar Feb 20 '22 22:02 sprmnt21

the fastest form I have found, at least for the example with a few lines, is the following (although less elegant).

vcat([[repeat(df2[r:r,Not(:sdf)],nrow(df2.sdf[r])) df2[r,:sdf]] for r in 1:nrow(df2)]..., cols=:union)

sprmnt21 avatar Feb 20 '22 23:02 sprmnt21

@nalimilan - what do you think. Currently in flatten when we see missing we error as it is not iterable. To make the above examples work we would need to start treating missing as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.

Wouldn't it make sense to treat any values that do not support length as scalars, and broadcast/repeat them as needed? It would have been logical to use Base.Broadcast.broadcastable for that, unfortunately we haven't done that and the result differs notbaly for strings (though we could special-case them as a temporary solution).

Regarding unnest, would https://github.com/JuliaData/DataFrames.jl/issues/3005 more or less implement the same features?

nalimilan avatar Mar 03 '22 08:03 nalimilan

We did not use broadcasting because 1-element collections are automatically resized with broadcasting, which can lead to hard to catch bugs in this context.

What we could do (following your proposal):

  • add kwarg strict::Bool=true, which by default does what we now have;
  • if strict=false then:
    • all values that do not support length are repeated as needed (and if in a row no values support length then the length is taken to be 1 - i.e. no-op)
    • strings are repeated instead of being flattened

What do you think?

bkamins avatar Mar 03 '22 22:03 bkamins

Ah right, broadcasting collections would be risky. But maybe we can repeat scalars (including strings) even without requiring strict=true? This sounds quite safe.

nalimilan avatar Mar 04 '22 13:03 nalimilan

We could. It would be mildly breaking but acceptable. We then could just do strict::Bool=false by default, and if someone wants safety one can pass strict=true.

bkamins avatar Mar 04 '22 14:03 bkamins

I am closing this issue as it is handled in #3258 so let us keep the discussion there (somehow I am not able to link this issue to that PR)

bkamins avatar Dec 28 '22 19:12 bkamins

Now tracked in https://github.com/JuliaData/DataFrames.jl/issues/3116

bkamins avatar Feb 05 '23 07:02 bkamins