DataFrames.jl
DataFrames.jl copied to clipboard
a new method of the flatten function in DataFrames
what do you think of the utility of having a method of the flatten function in DataFrames that expands a dataframe type df2 in the form dfexp?
df = DataFrame(a=1:3)
df1=DataFrame(b=11:15)
CSV.write("tmp1.csv", df)
CSV.write("tmp2.csv", df)
CSV.write("tmp3.csv", df1)
julia> df2
3×3 DataFrame
Row │ files other_cols subdf
│ String Int64 DataFrame
─────┼─────────────────────────────────────
1 │ tmp1.csv 1 3×1 DataFrame
2 │ tmp2.csv 2 3×1 DataFrame
3 │ tmp3.csv 3 5×1 DataFrame
julia> dfexp=flatten(df2,:subdf)
11×4 DataFrame
Row │ files other_cols a b
│ String Int64 Int64? Int64?
─────┼────────────────────────────────────────
1 │ tmp1.csv 1 1 missing
2 │ tmp1.csv 1 2 missing
3 │ tmp1.csv 1 3 missing
4 │ tmp2.csv 2 1 missing
5 │ tmp2.csv 2 2 missing
6 │ tmp2.csv 2 3 missing
7 │ tmp3.csv 3 missing 11
8 │ tmp3.csv 3 missing 12
9 │ tmp3.csv 3 missing 13
10 │ tmp3.csv 3 missing 14
11 │ tmp3.csv 3 missing 15
this is a function that does what is indicated, just to give an idea of what is required not to suggest how it should be done.
mapreduce(r->crossjoin(DataFrame(r[Not(:subdf)]),r.subdf), (x, y) -> vcat(x, y; cols = :union), eachrow(df2))
As commented on Slack. The operation you request is vcat
. The only difference with the standard vcat
is that now vcat
allows you to create only one indicator column and you want more than one column. We could consider extending the source
keyword API to allow what you ask for.
Let us wait for opinions.
thanks for your attention. Another way of looking at it could be the following. Think of the starting table df2 as if it were the result of a groupby and the table dfexp was the result of some kind of un-group.
But under your thinking all data frames in :subdf
columns would have to have exactly the same column names and eltypes. This clearly is not the case in your example. That is why I quote vcat
as the crucial difference between vcat
and un-group is that vcat
allows you do decide how you want to handle cases when column names or eltypes do not match.
Additionally, I think that allowing ByRow
functions in combine
to re-size would solve part of this problem.
It would not work if the data frames have different keys, and you need to convert to Tables.columntable
(i.e. a NamedTuple
) because data frames don''t support keys. But other than that,
combine(df, :c => ByRow(Tables.columntable) => AsTable)
would maybe work.
I'm not sure this is desirable though. This might not be consistent with how combine
actually works / should be thought about.
This is what I also thought but I have exactly the same reservations.
Now I have realized that actually assuming the columns match in all data frames the following works assuming non-dataframe columns form unique keys (which I guess is expected):
julia> df = DataFrame(x=[1,2,3], y=[DataFrame(p=1:2,q=11:12), DataFrame(p=3:4,q=13:14), DataFrame(p=5:6,q=15:16)])
3×2 DataFrame
Row │ x y
│ Int64 DataFrame
─────┼──────────────────────
1 │ 1 2×2 DataFrame
2 │ 2 2×2 DataFrame
3 │ 3 2×2 DataFrame
julia> combine(x -> only(x).y, groupby(df, Not(:y)))
6×3 DataFrame
Row │ x p q
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 11
2 │ 1 2 12
3 │ 2 3 13
4 │ 2 4 14
5 │ 3 5 15
6 │ 3 6 16
and if it is not unique:
combine(x -> reduce(vcat, x.y), groupby(df, Not(:y)))
Here she is. This solution demonstrates the potential of DataFrames to also deal with situations, for which perhaps it was not specifically designed.
Would Tables.dictrowtable
meet your requirements?
Maybe. But you should help me figure out how to fit it to the case.
this is what I can do ...
transform(df2, :sdf =>Tables.dictrowtable(c->mapreduce(x->copy.(eachrow(x)), vcat, c)) => AsTable)
but i get
LoadError: ArgumentError: no default Tables.rows
implementation for type: var"#83#85"
I meant this:
julia> df = DataFrame(id=1:2, a=[DataFrame(x=1:2), DataFrame(y=3:4)])
2×2 DataFrame
Row │ id a
│ Int64 DataFrame
─────┼──────────────────────
1 │ 1 2×1 DataFrame
2 │ 2 2×1 DataFrame
julia> select(df, :id, :a => Tables.dictrowtable => AsTable)
2×3 DataFrame
Row │ id x y
│ Int64 Array…? Array…?
─────┼─────────────────────────
1 │ 1 [1, 2] missing
2 │ 2 missing [3, 4]
and then ideally you would do:
flatten(select(df, :id, :a => Tables.dictrowtable => AsTable), [:x, :y])
but now I see that I need to fix flatten
to add support for missing
to it. I will open a PR for this.
Ah - this is not as easy as I have thought.
@nalimilan - what do you think. Currently in flatten
when we see missing
we error as it is not iterable. To make the above examples work we would need to start treating missing
as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.
As a second thought - maybe adding something like unnest
from dplyr would be useful?
another way to get the result using only "table functions"
this way flatten is used before dictrowtables and flatten doesn't have to handle missing.
Perhaps :-)
transform(flatten(transform(df2,:sdf=>ByRow(Tables.rowtable), renamecols=false),:sdf), :sdf=>Tables.dictrowtable=>AsTable)
the fastest form I have found, at least for the example with a few lines, is the following (although less elegant).
vcat([[repeat(df2[r:r,Not(:sdf)],nrow(df2.sdf[r])) df2[r,:sdf]] for r in 1:nrow(df2)]..., cols=:union)
@nalimilan - what do you think. Currently in
flatten
when we seemissing
we error as it is not iterable. To make the above examples work we would need to start treatingmissing
as having "any" number of elements - flexibly adjusting to the remaining columns that define length. This is a bit hard to implement (of course doable), but the question is if this would be an intuitive behavior.
Wouldn't it make sense to treat any values that do not support length
as scalars, and broadcast/repeat them as needed? It would have been logical to use Base.Broadcast.broadcastable
for that, unfortunately we haven't done that and the result differs notbaly for strings (though we could special-case them as a temporary solution).
Regarding unnest
, would https://github.com/JuliaData/DataFrames.jl/issues/3005 more or less implement the same features?
We did not use broadcasting because 1-element collections are automatically resized with broadcasting, which can lead to hard to catch bugs in this context.
What we could do (following your proposal):
- add kwarg
strict::Bool=true
, which by default does what we now have; - if
strict=false
then:- all values that do not support
length
are repeated as needed (and if in a row no values supportlength
then the length is taken to be 1 - i.e. no-op) - strings are repeated instead of being flattened
- all values that do not support
What do you think?
Ah right, broadcasting collections would be risky. But maybe we can repeat scalars (including strings) even without requiring strict=true
? This sounds quite safe.
We could. It would be mildly breaking but acceptable. We then could just do strict::Bool=false
by default, and if someone wants safety one can pass strict=true
.
I am closing this issue as it is handled in #3258 so let us keep the discussion there (somehow I am not able to link this issue to that PR)
Now tracked in https://github.com/JuliaData/DataFrames.jl/issues/3116