DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Improve the performance of describe() in the case of missing values.

Open sl-solution opened this issue 3 years ago • 25 comments

Congratulations on release 1.0.0

Regarding the missing values in the describe() function would it be possible to drop the skipmissing() function? The reason for this is purely performance wise. To demonstrate let me run some benchmarks (I assume the extreme case when all variables have missing values, other situation would be very large data set with few variables with missing values)

df = DataFrame(rand(10^5,100),:auto)
allowmissing!(df)
@btime describe(df,:mean);
  9.571 ms (1070 allocations: 87.75 KiB)

however if we used customised mean(), something like

Statistics.mean(f, ::AbstractArray{Missing,1}) = missing

function Statistics.mean(f, x::AbstractArray{Union{T,Missing},1}) where {T <: Number}
    _op(y1,y2) = (Base.add_sum(y1[1],y2[1]),Base.add_sum(y1[2],y2[2]))::Tuple{T,Int}
    _dmiss(y) = (ismissing(f(y)) ? zero(T) : f(y), !ismissing(f(y)))::Tuple{T,Bool}
    sval, n = mapreduce(_dmiss, _op, x)::Tuple{T, Int}
    n == 0 ? missing : sval/n
end

Statistics.mean(x::AbstractArray{Union{T,Missing},1}) where {T <: Number} = mean(identity, x)

then we half the running time:

@btime mapcols(mean, df);
  4.325 ms (232 allocations: 23.97 KiB)

Customising the std() function gives even more benefit, (in the following code I used a tweaked version of std())

@btime describe(df, :std);
  17.819 ms (1469 allocations: 94.02 KiB)
@btime mapcols(std, df);
  4.876 ms (232 allocations: 23.97 KiB)

sl-solution avatar Apr 21 '21 22:04 sl-solution

See https://github.com/JuliaData/DataFrames.jl/pull/2694 and the issued referenced there for a discussion.

The key question is if you need describe to be super fast? (I do not say it should not be, but it would be good to understand the use case before moving forward with the discussion)

bkamins avatar Apr 21 '21 22:04 bkamins

Just one thing to add is that the situation for median should be similar to std example, i.e. 4x faster, and in this case we can easily save seconds for medium to large data sets.

sl-solution avatar Apr 21 '21 23:04 sl-solution

Now that I have slept over your issue I understand that the core of what you want to highlight is:

julia> x = rand(10^6);

julia> @btime mean($x);
  282.190 μs (0 allocations: 0 bytes)

julia> @btime mean(skipmissing($x));
  1.034 ms (0 allocations: 0 bytes)

And I agree that this is significant. However, this should be fixed in Statistics.jl and StatsBase.jl as this is a general performance hiccup - @nalimilan what do you think?

bkamins avatar Apr 22 '21 06:04 bkamins

I guess I can add a little to this: I think the skipmissing(), mean() ... are supposed to be general functions to work with many different data structures in many different situations, however, in DataFrames we are dealing with rectangle data (and in practice, mostly with numbers and/or string as eltype) and might be easier(?!) to optimise these, specially for mean, std, median,... which are very well defined in the context of data analysis.

sl-solution avatar Apr 22 '21 11:04 sl-solution

just to add an example to demonstrate my point let me go through a simple example. In the following code there is nothing wrong about the count() function, however, in the context of counting the number of missing values, we don't need to use it.

x = rand(10^6)
x = allowmissing(x)
@btime count(ismissing,x);
  501.475 μs (0 allocations: 0 bytes)
@btime mapreduce(ismissing,+,x)
 124.164 μs (0 allocations: 0 bytes)

sl-solution avatar Apr 22 '21 11:04 sl-solution

This is a very good example, but I think it supports my claim if you investigate it further (I am re-running all to have timings on the same machine):

julia> @btime count(ismissing,$x);
  501.200 μs (0 allocations: 0 bytes)

julia> @btime mapreduce(ismissing,+,$x);
  87.300 μs (0 allocations: 0 bytes)

julia> @btime count(ismissing,$x,dims=1);
  84.600 μs (1 allocation: 96 bytes)

and as you can see the problem is with count implementation in Julia Base.

In general the design principle we want to stick to is to solve performance issues at their root. In this case ismissing, Number, and count are defined in Julia Base, so the performance problem should be solved in Julia Base.

Similarly mean is defined in Statistics.jl, so its performance should be solved there. If we started to do otherwise we would immediately hit the problems with bad consequences of type piracy.

Additionally - as shown above - as you can see - the problem with count is probably rather easy to fix in Julia Base.

bkamins avatar Apr 22 '21 11:04 bkamins

Also:

julia> @btime count(Base.Generator(ismissing, x))
  71.000 μs (1 allocation: 16 bytes)

bkamins avatar Apr 22 '21 11:04 bkamins

See https://github.com/JuliaLang/julia/pull/40564

bkamins avatar Apr 22 '21 11:04 bkamins

I agree things should and probably can be improved in Julia rather than in DataFrames. The only change that could be appropriate in DataFrames is to call collect(skipmissing(x)) or view(x, .!ismissing.(x)) internally if paying this cost once is faster given that we call multiple functions (this is something that Base functions cannot do since they don't know this). But AFAICT that's not really the case (#2694).

nalimilan avatar Apr 22 '21 12:04 nalimilan

Also it's worth noting that mean is relatively complex as it ensures that accumulation is performed in the output type (https://github.com/JuliaLang/Statistics.jl/pull/25). Maybe that hurts performance with skipmissing and it could be simplified for simple cases like Float64.

nalimilan avatar Apr 22 '21 12:04 nalimilan

Yes, but it should happen in Statistics.jl.

bkamins avatar Apr 22 '21 12:04 bkamins

Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.

sl-solution avatar Apr 22 '21 18:04 sl-solution

Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.

Yes, but this should happen in general, not just for DataFrames.jl.

bkamins avatar Apr 22 '21 18:04 bkamins

Just a side question, is there any situation that some one working with DataFrame wouldn't like to deal with missings? (i.e. a scenario that f(x::Vector) return missing when any but not all of x is missing is desirable)

sl-solution avatar Apr 23 '21 22:04 sl-solution

I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want missing to propagate like e.g. in:

julia> sum([1, missing])
missing

?

bkamins avatar Apr 23 '21 22:04 bkamins

I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want missing to propagate like e.g. in:

julia> sum([1, missing])
missing

?

yes, something like this.

sl-solution avatar Apr 24 '21 00:04 sl-solution

I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:

> sum(c(1, NA))
[1] NA
> mean(c(1, NA))
[1] NA

I guess the reason is to make sure that missing do not get silently ignored by the analyst, but rather consciously handled.

bkamins avatar Apr 24 '21 08:04 bkamins

I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:

> sum(c(1, NA))
[1] NA
> mean(c(1, NA))
[1] NA

I guess the reason is to make sure that missing do not get silently ignored by the analyst, but rather consciously handled.

It may not be commonly agreed standard, since I can not recall any other statistical software with similar behaviour. The problem with this approach to handle the missings is that we bother the analyst every time to make sure missings are not ignored. But a simple way to remind analyst this is to display the variables with missings differently which already DataFrames does by showing different column type. My point with this is that skipmissing might not be only approach for DataFrames. Since there is no such a concept as "skipping missings". skipmissing is basically deleting observation with missing, this work for some functions like sum(although not efficiently currently) however, if someone likes to subtract each observation from the mean value, skip works only partially. DataFrames approach to missing values can be handling every thing behind the scene and just keep user informed(e.g. by displaying some notes about the producing missing during some tasks ).

sl-solution avatar Apr 24 '21 09:04 sl-solution

Please keep in mind that DataFrames.jl does not - and will not - define any statistical functions. It is a package for managing tabular data. I see your point, but simply the functions you ask for are not and will not be defined in DataFrames.jl. They should be defined in separate packages and then they can just be used in DataFrames.jl as any other functions (and if you are willing to implement such a package it would be very interesting to see the comparison). E.g. the current behavior of sum is defined in Julia Base and it will not change till 2.0 release of Julia (and I assume it is unlikely that such change would be accepted by core Julia developers), so you need a separate package with a separate function definition that would do what you want (the good thing about Julia is that it should be relatively easy to do and it will be fast).

e.g. by displaying some notes about the producing missing during some tasks

This is something we will not do. DataFrames.jl does not display any messages when it does its operations. This package is designed for production use where the assumption is that such messages would be never seen. Messages are passed by function return values or errors thrown (and returning missing is exactly one of such messages).

bkamins avatar Apr 24 '21 09:04 bkamins

I understand your points and I am ok with them, but I was thinking about something mild like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of Julia.

sl-solution avatar Apr 24 '21 10:04 sl-solution

You seem to assume that operations on data frame columns can be faster than on general AbstractArray or AbstractVector objects, but that's definitely not the case. And sum(skipmissing(x)) is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean is slower but I already developed this above).

I understand your points and I ok with them, but I was thinking about something like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of Julia.

I don't understand what this means.

nalimilan avatar Apr 24 '21 10:04 nalimilan

I don't understand what this means.

Sorry for confusion, let me elaborate this. In DataFrames when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.

... And sum(skipmissing(x)) is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean is slower but I already developed this above).

function sim_sum(x::Vector{Union{T, Missing}}) where T
           all(ismissing, x) && return missing
           _dmiss(y) = ismissing(y) ?  zero(T) :  y
           mapreduce(_dmiss, Base.add_sum, x)
  end
x=rand(10^6)
x=allowmissing(x)
x[rand(1:length(x),1000)].=missing
@btime sum(skipmissing($x))
  880.731 μs (5 allocations: 80 bytes)
499297.9721823642
@btime sim_sum(($x))
  328.393 μs (0 allocations: 0 bytes)
499297.9721823642

sl-solution avatar Apr 24 '21 11:04 sl-solution

Yes, but this means that we should improve sum(skipmissing(x)) in Julia base. Thank you for pointing this out.

bkamins avatar Apr 24 '21 11:04 bkamins

Interesting. This doesn't happen with integers, only with floats. But this is more tricky than it seems due to the requirement to handle -0.0. See https://github.com/JuliaLang/julia/issues/40599.

nalimilan avatar Apr 24 '21 21:04 nalimilan

Sorry for confusion, let me elaborate this. In DataFrames when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.

A special implementation is used because no equivalent operation exists in Base. That's not the case for things that describe does.

nalimilan avatar Apr 24 '21 21:04 nalimilan