DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Improve flatten (slightly breaking)

Open bkamins opened this issue 4 years ago • 8 comments
trafficstars

Users find it confusing that strings are iterable, see https://discourse.julialang.org/t/flatten-in-case-column-contains-string-and-array/61284/5. They are not iterable in broadcasting. On the other hand we should not follow the full semantics of Base.broadcastable as it will then eg. disallow Dicts (this was the original reason why I went for iterator interface).

I think the way to go is to create a list of special cases. Here is a collection of standard "suspects that I think we should not flatten by default:

  • AbstractString
  • Pair
  • Markdown.MD

the question is if we want them to be left "as is" or throw an error. If we leave them "as is" then probably we should also consider what other types we want to leave "as is" (they currently error):

  • Symbol
  • Missing
  • Nothing

(I have listed only the most common cases)

The general question in the first place is: do we find it useful to complicate the API of flatten. My fist instinct is to throw an error on the cases we do not want to allow being flattened as this will be simpler.

bkamins avatar May 17 '21 16:05 bkamins

As proposed in https://github.com/JuliaData/DataFrames.jl/issues/2766 we could:

  • error by default in the cases I have listed
  • add a kwarg that would leave "as is" all values that do not support length or form my list

bkamins avatar May 17 '21 16:05 bkamins

I am much more concerned about supporting missing than AbstractString. Though I understand both are important, I think the probability of having a mixture of Vector{String} and String is unlikely.

pdeffebach avatar May 17 '21 17:05 pdeffebach

Yeah - that is why I have listed all options to consider 😄. I also think missing will be much more common.

@sprmnt21 - thank you for raising this issue as it really is relevant.

bkamins avatar May 17 '21 17:05 bkamins

Maybe missing should be handled orthogonally to this issue, with an allowmissing keyword?

I think errors on scalars (for example symbols) are unexpected: since numbers (the most common scalars) work, I would expect that every value is treated either as scalar or as collection, and errors would only occur when there is a "real" issue like missing.

We could have either a blacklist (AbstractString, Pair, ...) or a whitelist (AbstractArray, AbstractDict, Tuple, Generator...). In either case, it seems a bit arbitrary to have a fixed list, so maybe we should also have a keyword argument for the user to supply their own list? For example the user could call flatten(df, :a, scalars=[AbstractString]) to solve the issue in the Discourse thread.

knuesel avatar May 17 '21 18:05 knuesel

I think we should take no methods for length seriously and not do a white-list. I'm not super sure about keyword arguments. This seems rate enough that people can do cleaning on their own before, i.e. put all Strings in [x].

Do we know if this problem came about "naturally" or was it trying to work through the docs? because the post on discourse certainly seems a bit contrived.

pdeffebach avatar May 17 '21 18:05 pdeffebach

an example of origin, I don't know how "natural", of this situation is the dataframe originating from the following expressions

                  using DataFrames
        list1 = DataFrame(ID=["0001P:ABCD","0001R:ABCD","2324P:CDCL","5666TB:GHLA","789789TSB:ASLT","2324R:CDCL","79067P:OPRA","79067R:OPRA"])
        list2 = DataFrame(ID =["0001P:ABCD","0001R:ABCD","2324R:CDCL","5666TB:GHLA"], Pair = ["PairOne","PairOne","Two","Two"])
        
        function pairing(s1,s2)
                      if length(s1)==length(s2)
                          idx=findall(x->x!=0,cmp.(collect(s1),collect(s2)))
                          return Set([s1[idx],s2[idx]])
                      else
                          return Set([])
                      end
                  end
                  list = leftjoin(list1,list2, on = :ID)
                  nr=nrow(list)
                  list.twin=[[i, something(findfirst(x->x==1,[Set(["P","R"])==pairing(list.ID[i],list.ID[j]) for j in 1:nr]),i)] for i in 1:nr]
                  list.pair1=[coalesce(list.Pair[t[1]],list.Pair[t[2]])   for t in list.twin ]
                  list.pair_1=[coalesce(list.pair1[r],unique(list.ID[list.twin[r]])) for r in 1:nr]
                  sort(list,:Pair)

which was an attempt to resolve this problem

but, I also believe, that it is quite artificial and unlikely.

sprmnt21 avatar May 17 '21 20:05 sprmnt21

Good point about taking no length seriously.

I also haven't seen a real-life case... I guess flattening a heterogeneous column is uncommon. So maybe not worth adding a scalars or similar keyword. On the other hand, cleaning the data by adding wrappers would probably have much worse performance?

knuesel avatar May 18 '21 08:05 knuesel

Following the hint of Base.broadcastable, which uses axes, it leads to the implementation below. Note that cell x would be flattened according to length(axes(x)[1]), i.e., only the outer most layer (bracket [ ] for vector, ( ) for tuple) would be removed (may be called flatten in depth level 1). Eg. the column with two items

[["aa", ["bb"]], "cc"]

is flattened to three items ["aa", ["bb"], "cc"].

function flatten_depth1(df::AbstractDataFrame,
                 cols::Union{ColumnIndex, MultiColumnIndex})
    _check_consistency(df)

    idxcols = index(df)[cols]
    isempty(idxcols) && return copy(df)
    col1 = first(idxcols)
    # lengths = length.(df[!, col1])
    flen(x) = try length(axes(x)[1]) catch; 1 end
    lengths = flen.(df[!, col1])
    if length(idxcols) > 1
        for col in idxcols[2:end]
            v = df[!, col]
            if any(x -> flen(x[1]) != x[2], zip(v, lengths))
                r = findfirst(x -> x != 0, length.(v) .- lengths)
                colnames = _names(df)
                throw(ArgumentError("Lengths of iterables stored in columns :$(colnames[col1]) " *
                                    "and :$(colnames[col]) are not the same in row $r"))
            end
	end
        sort!(idxcols)
    end

    new_df = similar(df[!, Not(cols)], sum(lengths))
    for name in _names(new_df)
        repeat_lengths!(new_df[!, name], df[!, name], lengths)
    end
    # length(idxcols) > 1 && sort!(idxcols)
    for col in idxcols
        # col_to_flatten = df[!, col]
        # flattened_col = col_to_flatten isa AbstractVector{<:AbstractVector} ?
        #     reduce(vcat, [i for i in j, j in df[!, col]]) :
        #     collect(Iterators.flatten(col_to_flatten))

        # [["aa", ["bb"]], "cc", ("dd", "ee")] flattened to ["aa", ["bb"], "cc", "dd", "ee"]
        flattened_col = mapreduce(vcat, df[!, col], lengths) do i,l
            l > 1 ? [j for j in i] : ifelse(isa(i, AbstractVector), i, [i])
        end
        insertcols!(new_df, col, _names(df)[col] => flattened_col)
    end

    return new_df
end

ianqsong avatar May 19 '21 09:05 ianqsong

I have thought about it and would add scalar kwarg (we could discuss the name). It would accept a type and values of this type would be repeated as many times as needed (aka broadcasting). By default no type would be treated as a scalar. But for example if you pass scalar=Missing then missings would be expanded. If you pass scalar=Union{Missing, AbstractString} then missings or strings would be expanded.

What do you think?

CC @nalimilan

bkamins avatar Oct 14 '22 21:10 bkamins

I kind of regret we don't treat strings as scalars, but now that we do I agree that allowing to pass types to treat as scalars via an argument is a reasonable solution.

nalimilan avatar Oct 17 '22 20:10 nalimilan