InMemoryDatasets.jl
InMemoryDatasets.jl copied to clipboard
Flattening of strings
I wonder if the result of the flatten function in these cases is the most expected one. Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?
julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])
4×2 Dataset
Row │ A B
│ identity identity
│ Int64? Any
─────┼──────────────────────
1 │ 1 ["a", "b"]
2 │ 2
3 │ 3 b
4 │ 4 ["a", "c"]
julia> flatten(df, :B)
5×2 Dataset
Row │ A B
│ identity identity
│ Int64? Any
─────┼────────────────────
1 │ 1 a
2 │ 1 b
3 │ 3 b
4 │ 4 a
5 │ 4 c
julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "pippo", "b", ["a","c"]])
4×2 Dataset
Row │ A B
│ identity identity
│ Int64? Any
─────┼──────────────────────
1 │ 1 ["a", "b"]
2 │ 2 pippo
3 │ 3 b
4 │ 4 ["a", "c"]
julia> flatten(df, :B)
10×2 Dataset
Row │ A B
│ identity identity
│ Int64? Any
─────┼────────────────────
1 │ 1 a
2 │ 1 b
3 │ 2 p
4 │ 2 i
5 │ 2 p
6 │ 2 p
7 │ 2 o
8 │ 3 b
9 │ 4 a
10 │ 4 c
is it possible(or useful) to support mapformats
in flatten!
? it is very useful for the example from @sprmnt21
Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?
This is the Julia
's behaviour, however, I don't like this. Maybe adding a keyword argument would be a good idea?
In long term, I am interested to have a fixed-width String (probably a better implementation of Characters
) type for InMemoryDatasets
that treats empty String as missing
, since I think zero length string and/or empty string should be treated as missing value in data analysis workflow.
is it possible(or useful) to support
mapformats
inflatten!
? it is very useful for the example from @sprmnt21
I moved this post to #57, so it is easier to track.
On second thoughts, I classify this as a bug, ~~and a fix is coming soon~~.
Suppose that, somehow, you have come to have a dataset like this:
7×3 Dataset
Row │ id outcome sds
│ identity identity identity
│ Int64? Bool? Dataset?
─────┼─────────────────────────────────
1 │ 1 false 3×3 Dataset
2 │ 1 true 1×3 Dataset
3 │ 1 false 1×3 Dataset
4 │ 2 false 1×3 Dataset
5 │ 2 true 1×3 Dataset
6 │ 2 false 1×3 Dataset
7 │ 3 true 3×3 Dataset
Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?
PS
I got the dataset with nested tables in the following way:
using InMemoryDatasets
ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
"2019-04-29", "2019-05-10", "2019-03-20",
"2019-04-22", "2019-05-04", "2019-11-01",
"2019-11-10", "2019-12-12"]),
outcome = [false, false, false, true, false, false,
true, false, true, true, true])
gb=gatherby(ds, [1, 3], isgathered = true)
cgb1 = combine(gb, ("id",2,:outcome) => ((x...)-> Dataset(; zip([:a,:b,:c], x) ...))=>:sds)
Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?
Few remarks:
-
IMD
has a new function,eachgroup
, which can be used to iterate grouped data sets, and in similar situation usingeachgroup
is the recommended approach. -
flatten/!
works based onlength
, andlength
is not defined for data sets. Thus, I think such a functionality should be placed in a new function. (?) - To achieve what you are looking for, e.g. in this case, you may use
append!
On second thoughts, I classify this as a bug, ~and a fix is coming soon~.
Originally, I thought we could use a separate path for empty collections, however, this creates other sort of problems. E.g. if we have a Int[]
value, keeping it as Int[]
is not consistent because it is not flatten properly (?)
I think we should leave it as a quirk of the package (?)
Hi there,
does flatten Int[]
as nothing
solve this problem?
does flatten
Int[]
asnothing
solve this problem?
probably not, since dealing with nothing
is not easy. IMD
handles missing
for many function efficiently, however, nothing
will be inconvenient.
I don't know what Int [] is exactly / formally, other than to think of it as an empty vector. But leaving it as it is, could it give rise to problems?
A different hypothesis, perhaps a bit risky, would be to put missing
for everything that has a defined length and equal to 0.
I don't know what Int [] is exactly / formally, other than to think of it as an empty vector. But leaving it as it is, could it give rise to problems?
Leaving Int[]
as it is has two problems, a) it is not consistent with flattening operation, b) makes the sub-sequence operations on the output data sets inefficient (e.g. if using flattening changes everything to Int
and just one observation remains as Int[]
the whole type of the processed column is affected)
A different hypothesis, perhaps a bit risky, would be to put
missing
for everything that has a defined length and equal to 0.
I am not sure if this is a right way to handle this - empty object is not equivalent to missing
(?) BTW, #57 provide a convenient way to do this.