DataTables.jl icon indicating copy to clipboard operation
DataTables.jl copied to clipboard

complete_cases for selected columns

Open mkborregaard opened this issue 8 years ago • 21 comments

Arising from discussion here: https://discourse.julialang.org/t/is-there-a-dropna-for-dataframe/1777

I think it could be useful to have a version of complete_cases that only drops rows that have Nulls in specified columns. The use case is if you have a big dataset with thousands of rows and hundreds of columns, some of which have Nulls, and you want to do an operation on a subset of the columns in a statistical function (lm EDIT: I see that lm already deals with Nulls) or plot it. Running complete_cases would remove rows that had NAs in columns unrelated to the function.

In my experience, this scenario is very common: All the data is collected in a single DataTable, even some variables that may only be useful for a few operations and that have lots of Null values.

Also, I think it would be nice and efficient if this version (or all versions) of complete_cases returned a view. This would be much more efficient for on-the-fly computations, and would help distinguish complete_cases from dropna functionality. EDIT: I guess the user can make a view if desired with the current functionality, but using views would align functionality more with complete_cases!.

In the discourse post (https://discourse.julialang.org/t/is-there-a-dropna-for-dataframe/1777/19) , @mwsohn suggested some code that could form a useful basis for a PR. I am posting a modification of this for easy reference, but I take no credit for the code. Except for BitArray and view (which might also be suggested modifications to the existing complete_cases if BitArrays are more efficient), the only change from the current code in DataFrames is the args argument.

function complete_cases(dt::DataTable, args::Vector{Symbol})
    ba = BitArray(trues(size(dt,1)))
    for sym in args
        ba &= ~BitArray(isnull.(dt[sym]))
    end
    view(dt, ba) # or just ba if a view is not desired
end

mkborregaard avatar Mar 22 '17 09:03 mkborregaard

Sorry for the many edits.

mkborregaard avatar Mar 22 '17 10:03 mkborregaard

A pared down version is here: https://github.com/JuliaData/DataTables.jl/pull/39 I guess the suggestion to use views is separate.

mkborregaard avatar Mar 22 '17 10:03 mkborregaard

Let's comment on the PR for the first part of the proposal.

WRT returning a view, that makes sense, but we should have a clear policy applying to all functions (e.g. to ~nonunique~ unique too). So we need to identify all functions which would need to be checked and possibly changed.

nalimilan avatar Mar 22 '17 11:03 nalimilan

I like the views idea, but I'm worried making it part of completecases could be confusing. Then dropnull, dropnull! and completecases would all return DataTables and it wouldn't be immediately clear why completecases is needed without checking the documentation. The goal of https://github.com/JuliaData/DataTables.jl/pull/6 was to introduce dropnull as a means of making completecases more distinct, and this would make them more similar again.

Same reservations for nonunique, which then would behave similarly to unique except it would return a view rather than a copy? Anyone who desires a boolean array would have to implement new functions themselves.

Could returning a view (rather than a copy or inplace operation) be a keyword argument for dropnull and unique dropnull(DataTable, view=false) dropnull(DataTable, cols, view=false) unique(DataTable, view=false) unique(DataTable, cols, view=false)?

👍 on the main idea of having completecases for selected columns.

cjprybol avatar Mar 22 '17 17:03 cjprybol

I like the idea, though having it a keyword would violate type stability I think?

mkborregaard avatar Mar 22 '17 17:03 mkborregaard

Otherwise I see and agree with your argument for making completecases (and nonunique) return a Boolean vector and dropnull return (possibly) a view. IMHO view = true could even be the default, as that should work in most cases and be much faster.

mkborregaard avatar Mar 22 '17 18:03 mkborregaard

Of course when I said returning a view was a good idea, I was referring to functions which currently return a data table. I just confused unique and nonunique: only the former would return a view. Anyway we should rename nonunique that's too confusing.

nalimilan avatar Mar 22 '17 18:03 nalimilan

This all sounds good to me!

Out of curiosity, why does using keyword arguments violate type stability? I've seen examples where keyword arguments are typed, like here in CSV.jl. Are those not type stable?

nonunique -> isunique? isuniquerow? with inverted true/false output relative to nonunique

cjprybol avatar Mar 22 '17 18:03 cjprybol

Keyword arguments can be typed, the problem is that the return type of a function cannot depend on the value of an argument if we want inference to figure it out (there are also complications due to the type of keyword arguments not being taken into account by inference to compute the return type, but that's another issue).

I'm not sure about the naming. isunique sounds too much like it would return a single Bool. (non)uniqueindices would be more explicit, but maybe a bit verbose.

nalimilan avatar Mar 22 '17 18:03 nalimilan

e.g.

julia> function test(; tst = true)
       tst && return 1
       1.
       end
test (generic function with 1 method)

julia> function test2()
       1
       end
test2 (generic function with 1 method)

julia> function test2(x)
       1.
       end
test2 (generic function with 2 methods)

julia> @code_warntype test(tst = true)
Variables:
  #unused#::#kw##test
  #temp#@_2::Array{Any,1}
  ::#test
  tst::Any
  #temp#@_5::Int64
  #temp#@_6::Int64
  #temp#@_7::Int64
  #temp#@_8::Any
  #temp#@_9::Union{Float64,Int64}

Body:
  begin 
      tst::Any = true
      SSAValue(2) = (Base.arraylen)(#temp#@_2::Array{Any,1})::Int64
      SSAValue(3) = (Base.select_value)((Base.sle_int)(0,1)::Bool,(Base.box)(Int64,(Base.ashr_int)(SSAValue(2),(Base.box)(UInt64,1))),(Base.box)(Int64,(Base.shl_int)(SSAValue(2),(Base.box)(UInt64,(Base.box)(Int64,(Base.neg_int)(1))))))::Int64
      SSAValue(4) = (Base.select_value)((Base.sle_int)(1,SSAValue(3))::Bool,SSAValue(3),(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64
      #temp#@_5::Int64 = 1
      6: 
      unless (Base.box)(Base.Bool,(Base.not_int)((#temp#@_5::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(4),1)))::Bool)) goto 21
      SSAValue(5) = #temp#@_5::Int64
      SSAValue(6) = (Base.box)(Int64,(Base.add_int)(#temp#@_5::Int64,1))
      #temp#@_6::Int64 = SSAValue(5)
      #temp#@_5::Int64 = SSAValue(6)
      #temp#@_7::Int64 = (Base.box)(Int64,(Base.sub_int)((Base.box)(Int64,(Base.mul_int)(#temp#@_6::Int64,2)),1))
      #temp#@_8::Any = (Core.arrayref)(#temp#@_2::Array{Any,1},#temp#@_7::Int64)::Any
      unless (#temp#@_8::Any === :tst)::Bool goto 17
      tst::Any = (Core.arrayref)(#temp#@_2::Array{Any,1},(Base.box)(Int64,(Base.add_int)(#temp#@_7::Int64,1)))::Any
      goto 19
      17: 
      (Base.throw)($(Expr(:new, :(Base.MethodError), :((Core.getfield)((Core.getfield)((Core.getfield)(#test,:name)::TypeName,:mt),:kwsorter)), :((Core.tuple)(#temp#@_2,)::Tuple{Array{Any,1},#test}))))::Union{}
      19: 
      goto 6
      21: 
      # meta: location REPL[7] #test#1 2
      unless tst::Any goto 26
      #temp#@_9::Union{Float64,Int64} = 1
      goto 29
      26:  # line 3:
      #temp#@_9::Union{Float64,Int64} = 1.0
      29: 
      # meta: pop location
      return #temp#@_9::Union{Float64,Int64}
  end::Union{Float64,Int64}

julia> @code_warntype test2(true)
Variables:
  #self#::#test2
  x::Bool

Body:
  begin 
      return 1.0
  end::Float64

mkborregaard avatar Mar 22 '17 18:03 mkborregaard

Yes, to be 100% honest I also started from the belief that completecases would return the complete cases, e.g. a DataTable :blush: . Maybe arenonunique ? (no that's horrible). I think isnonunique could work. And iscomplete.

mkborregaard avatar Mar 22 '17 19:03 mkborregaard

Yes, to be 100% honest I also started from the belief that completecases would return the complete cases, e.g. a DataTable 😊 .

^ coming from DataFrames, that's exactly what you should have expected!

Let's start with what we need, trying to forget what functions currently exist and pretending we have a clean slate.

  1. boolean vector indicating which rows are complete/nullfree.
  2. boolean vector indicating which rows are the first unique instance of a row.

Right now, completecases satisfies the first, and the inverse of nonunique, .!nonunique, would satisfy the second. I don't think we should add any function with non in front because the double-negative of .!non is kind of silly right from the beginning

  • deprecate the name nonunique

  • replace it with isunique.

    • isunique(dt, 1) would return a boolean vector indicating if for each row that row is the first unique instance of a given row in a datatable. .!isunique(dt, 1) == nonunique(dt)
    • isunique(dt, 2) would return a boolean vector indicating if for each column that column is the first unique instance of a given column in a datatable.
      • unique columns will be based on values without checking names, as columns cannot share names.
    • isunique(dt) = all(isunique(dt, 1)) && all(isunique(dt, 2))
  • We update unique(dt) to unique-ify rows AND columns. I imagine people don't usually have duplicate columns, so even though users currently expect unique(dt) to only apply to rows, I think this should be ok?

    • unique(dt, 1) unique-ify the datatable based on rows
    • unique(dt, 2) unique-ify the datatable based on columns
  • deprecate the name completecases

  • rename it to iscomplete

    • iscomplete(dt, 1) would return a boolean vector indicating if for each row that row is complete (nullfree). iscomplete(dt, 1) == completecases(dt)
    • iscomplete(dt, 2) would return a boolean vector indicating if for each column that column is complete (nullfree)
    • iscomplete(dt) = all(iscomplete(dt, 2))
      • and if we aren't doing this already, iscomplete(dt, 1)/completecases should maybe run iscomplete(dt, 2) first, as in most cases I think nrows >> ncols and if you only have to check that eltypes(cols) .!<: Nullable on 3 columns to avoid checking 100K rows, that might speed things up.

With these changes, iscomplete(dt) and isunique(dt) will return single boolean values, as @nalimilan suggested he would expect and I agree is probably the most logical based on the naming. I imagine users would begin running @assert iscomplete(dt) && isunique(dt) as a nice pre-check when working with new datasets. If you specify that you want to check row-wise or col-wise, then you'll break from that "single value" convention and get the boolean indices for all rows and columns as everyone (I think?) desires. Writing all(iscomplete(dt, 1)) isn't all that bad either if people really want single values for a single dimension. I think this will be helpful in that it unifies the behavior patterns of iscomplete and isunique and lets us completely start from scratch rather than carry over one of the two different behaviors of complete_cases and complete_cases!.

I think that covers all cases and seems to be a decent naming strategy that holds across functions and utilizes the widely used function(input, axis) convention to be specific about what return type you want. Thoughts?

cjprybol avatar Mar 22 '17 22:03 cjprybol

I gave those changes a shot and opened a PR for them. We may not decide to go that direction, but it was helpful to confirm the approach would be viable.

Others I have talked to said they would expect the function completecases to return a DataTable, so I think scrapping the name completecases entirely would be beneficial.

@mkborregaard could you apply the changes (returning a view and only checking the appropriate columns for complete cases) to dropnull in https://github.com/JuliaData/DataTables.jl/pull/39? If you wouldn't mind, could you also try a few benchmarks of the before/after speeds when writing tests? Having a few comparisons would be helpful both for evaluating the PR and to have on hand when we figure out how to automate performance testing.

Thanks to both of you for the explanations on the type insecurity.

cjprybol avatar Mar 23 '17 05:03 cjprybol

A few thoughts:

  1. R calls nonunique duplicated - so you can subset by subset(df, !duplicated(name)), which reads quite intuitively.
  2. Would having an argument (in your example 1 or 2, though I'd prefer :rows or :columns/:cols) to specify the direction conflict with the functionality I added/suggested to specify the columns to check for uniques/nulls (because of occupying the argument position)? Or would the column specification only apply to dropnull and unique? That would make unique and isunique quite different I think. Which may not be a bad thing, but I'm not sure.
  3. What is the use case for isunique(dt, 1 (or 2))? I think originally it was to use for sorting the datatable - e.g. df[!nonunique(df[1:3]), :], but that use case could be replaced by unique(dt, 1:3).

mkborregaard avatar Mar 23 '17 07:03 mkborregaard

And yes, I am happy to update dropnull as suggested and do the tests/benchmarks - I will do it at the same time as writing the promised tests for the predict PR :-) So it may be a few days.

mkborregaard avatar Mar 23 '17 07:03 mkborregaard

I don't like isunique(dt), as the question of whether there are duplicated columns does not make much sense to me. Also, to check whether there are duplicate rows, we should rather override allunique. Data tables are not exactly like arrays, their dimensions are not really symmetric so we don't always need to support operations for both rows and columns.

iscomplete(dt) makes more sense, but I'm not sure I like the generalization to rows and columns. I don't think there's any precedent for this in Julia, and in DataTables generally functions do not support choosing the dimension to consider: as I said, they are not symmetric. There's also the function anynull which does the same thing, but which could be deprecated in favor of any(isnull, x); at the very least we need to consider these functions together.

R calls nonunique duplicated - so you can subset by subset(df, !duplicated(name)), which reads quite intuitively.

@mkborregaard Yet that pattern is terribly wasteful since it creates a temporary vector. If we expect people to use the ! often, better define it other other way around. Also the term "duplicated" isn't used in the Julia API, "unique" (i.e. the reverse) is always used.

Would having an argument (in your example 1 or 2, though I'd prefer :rows or :columns/:cols) to specify the direction conflict with the functionality I added/suggested to specify the columns to check for uniques/nulls (because of occupying the argument position)? Or would the column specification only apply to dropnull and unique? That would make unique and isunique quite different I think. Which may not be a bad thing, but I'm not sure.

Yes, dropnull and unique currently operates row-wise, and this is very natural. This is another reason for not having iscomplete and isunique take the dimension as a second argument: better have consistent pairs of functions.

What is the use case for isunique(dt, 1 (or 2))? I think originally it was to use for sorting the datatable - e.g. df[!nonunique(df[1:3]), :], but that use case could be replaced by unique(dt, 1:3).

That's indeed a good question, since completecases clearly has its origins in the R inspiration of DataFrames. In R it's idiomatic to do df[complete_cases(df),]. In Julia you wouldn't never write this, and providing this function could even be misleading for new users.

Yet there may be valid use cases, like storing the vector of complete cases somewhere for later use (e.g. when building a model matrix, to remember where incomplete cases are for predict, cf. https://github.com/JuliaStats/StatsModels.jl/issues/17). But this is an advanced use case and the docs for completecases should point to dropnull/dropnull! as the recommended solution.

nalimilan avatar Mar 23 '17 12:03 nalimilan

All makes very good sense. So what do you suggest? Keeping the current methods, implementing a col-specific version of dropnull!, returning views rather than copies by default from methods that return a DataTable (the user can always copy manually), and changing the names of nonunique and completecases?

mkborregaard avatar Mar 23 '17 13:03 mkborregaard

I tried not to make too many assumptions about user patterns when thinking about isunique and iscomplete behavior. I think we all have our own assumptions about what kind of patterns are common and justifiable, but one justification I could think of for isunique on columns is what if someone wanted to do several outer joins and confirm that they hadn't introduced duplicate columns? Googling "remove duplicate column after join" yields > 250K results so I don't think it's an invalid case, even if rare.

cjprybol avatar Mar 23 '17 22:03 cjprybol

If we restrict the column subset to require an array I think all of these behaviors could be supported at once.

e.g.

unique(dt, 1) # unique based on rows
unique(dt, [1]) # unique, restricted to only checking for uniqueness in first column
etc...

@mkborregaard would that be an acceptable compromise to make when specifying specific columns to check?

cjprybol avatar Mar 23 '17 22:03 cjprybol

I'm not super fond of it, sorry. I agree with @nalimilan that it is important to keep the distinction between columns and rows in DataTables. And I have a personal dislike for functions where you cannot do indexin(3, 1:5) but have to do indexin([3], 1:5).

mkborregaard avatar Mar 24 '17 06:03 mkborregaard

I agree that using brackets/passing arrays where they aren't necessary is pretty ugly. What about recycling the on= keyword from join? That wouldn't change the return type as it did for my earlier suggestion, and I think everyone would immediately intuit what completecases(dt, on={Int, Symbol, Vector{Int}, Vector{Symbol}} means that the function is only checking either the individual (bracket free) column or the series of columns in the array. I'm also open to using :rows/:cols to indicate dimensions in addition to or instead of 1/2.

I find the argument that we shouldn't support columnar checks because datatables packages from other languages don't support them (lack of precedent) to be a weak one. There's evidence that people have run into the issue of checking columns for duplicates in the past and found it difficult enough to post to SO and other venues about. I'm in full agreement that the functionality you are asking for should be possible! I would just like that it didn't come at the cost of being unable to support other functionality.

edit: We could also make dim= be the required keyword and @mkborregaard could go forward with his original proposal in the original form. That too would allow all of these functionalities to coexist.

edit 2: if we use the dim= keyword we could leave the current system (only check rows, not rows and columns) in place by setting 1 or :rows as the default. That wouldn't require changing any code, behavior, or syntax but would enable extending these functions to columns.

cjprybol avatar Mar 24 '17 17:03 cjprybol