InMemoryDatasets.jl icon indicating copy to clipboard operation
InMemoryDatasets.jl copied to clipboard

Flattening of strings

Open sprmnt21 opened this issue 2 years ago • 11 comments

I wonder if the result of the flatten function in these cases is the most expected one. Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
5×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        3  b
   4 │        4  a
   5 │        4  c
julia> df = Dataset(A=[1,2,3,4], B=[["a","b"], "pippo", "b", ["a","c"]])
4×2 Dataset
 Row │ A         B
     │ identity  identity
     │ Int64?    Any
─────┼──────────────────────
   1 │        1  ["a", "b"]
   2 │        2  pippo
   3 │        3  b
   4 │        4  ["a", "c"]

julia> flatten(df, :B)
10×2 Dataset
 Row │ A         B        
     │ identity  identity
     │ Int64?    Any
─────┼────────────────────
   1 │        1  a
   2 │        1  b
   3 │        2  p
   4 │        2  i
   5 │        2  p
   6 │        2  p
   7 │        2  o
   8 │        3  b
   9 │        4  a
  10 │        4  c

sprmnt21 avatar Apr 30 '22 12:04 sprmnt21

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

monopolynomial avatar May 01 '22 01:05 monopolynomial

Are there any contraindications to (or is this notoriously preferable rather than) treating strings (even empty ones) as scalars in the context of the flatten function?

This is the Julia's behaviour, however, I don't like this. Maybe adding a keyword argument would be a good idea?

In long term, I am interested to have a fixed-width String (probably a better implementation of Characters) type for InMemoryDatasets that treats empty String as missing, since I think zero length string and/or empty string should be treated as missing value in data analysis workflow.

sl-solution avatar May 02 '22 07:05 sl-solution

is it possible(or useful) to support mapformats in flatten!? it is very useful for the example from @sprmnt21

I moved this post to #57, so it is easier to track.

sl-solution avatar May 02 '22 07:05 sl-solution

On second thoughts, I classify this as a bug, ~~and a fix is coming soon~~.

sl-solution avatar May 02 '22 10:05 sl-solution

Suppose that, somehow, you have come to have a dataset like this:

7×3 Dataset
 Row │ id        outcome   sds
     │ identity  identity  identity
     │ Int64?    Bool?     Dataset?
─────┼─────────────────────────────────
   1 │        1     false  3×3 Dataset
   2 │        1      true  1×3 Dataset
   3 │        1     false  1×3 Dataset
   4 │        2     false  1×3 Dataset
   5 │        2      true  1×3 Dataset
   6 │        2     false  1×3 Dataset
   7 │        3      true  3×3 Dataset

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

PS

I got the dataset with nested tables in the following way:

using InMemoryDatasets
ds = Dataset(id = [1,1,1,1,1,2,2,2,3,3,3],
date = Date.(["2019-03-05", "2019-03-12", "2019-04-10",
        "2019-04-29", "2019-05-10", "2019-03-20",
        "2019-04-22", "2019-05-04", "2019-11-01",
        "2019-11-10", "2019-12-12"]),
outcome = [false, false, false, true, false, false,
           true, false, true, true, true])

gb=gatherby(ds, [1, 3], isgathered = true)  

cgb1 = combine(gb, ("id",2,:outcome) => ((x...)-> Dataset(; zip([:a,:b,:c], x) ...))=>:sds)

sprmnt21 avatar May 02 '22 16:05 sprmnt21

Are there any contraindications to the flatten function acting on the column: sds expanding (and possibly renaming the names to avoid conflicts) the rows of the subtables?

Few remarks:

  • IMD has a new function, eachgroup, which can be used to iterate grouped data sets, and in similar situation using eachgroup is the recommended approach.
  • flatten/! works based on length, and length is not defined for data sets. Thus, I think such a functionality should be placed in a new function. (?)
  • To achieve what you are looking for, e.g. in this case, you may use append!

sl-solution avatar May 02 '22 20:05 sl-solution

On second thoughts, I classify this as a bug, ~and a fix is coming soon~.

Originally, I thought we could use a separate path for empty collections, however, this creates other sort of problems. E.g. if we have a Int[] value, keeping it as Int[] is not consistent because it is not flatten properly (?)

I think we should leave it as a quirk of the package (?)

sl-solution avatar May 02 '22 20:05 sl-solution

Hi there, does flatten Int[] as nothing solve this problem?

giantmoa avatar May 10 '22 23:05 giantmoa

does flatten Int[] as nothing solve this problem?

probably not, since dealing with nothing is not easy. IMD handles missing for many function efficiently, however, nothing will be inconvenient.

sl-solution avatar May 11 '22 06:05 sl-solution

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector. But leaving it as it is, could it give rise to problems?

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

sprmnt21 avatar May 11 '22 08:05 sprmnt21

I don't know what Int [] is exactly / formally, other than to think of it as an empty vector. But leaving it as it is, could it give rise to problems?

Leaving Int[] as it is has two problems, a) it is not consistent with flattening operation, b) makes the sub-sequence operations on the output data sets inefficient (e.g. if using flattening changes everything to Int and just one observation remains as Int[] the whole type of the processed column is affected)

A different hypothesis, perhaps a bit risky, would be to put missing for everything that has a defined length and equal to 0.

I am not sure if this is a right way to handle this - empty object is not equivalent to missing (?) BTW, #57 provide a convenient way to do this.

sl-solution avatar May 12 '22 07:05 sl-solution