DataFrames.jl
DataFrames.jl copied to clipboard
Improve the performance of describe() in the case of missing values.
Congratulations on release 1.0.0
Regarding the missing values in the describe() function would it be possible to drop the skipmissing() function? The reason for this is purely performance wise. To demonstrate let me run some benchmarks (I assume the extreme case when all variables have missing values, other situation would be very large data set with few variables with missing values)
df = DataFrame(rand(10^5,100),:auto)
allowmissing!(df)
@btime describe(df,:mean);
9.571 ms (1070 allocations: 87.75 KiB)
however if we used customised mean(), something like
Statistics.mean(f, ::AbstractArray{Missing,1}) = missing
function Statistics.mean(f, x::AbstractArray{Union{T,Missing},1}) where {T <: Number}
_op(y1,y2) = (Base.add_sum(y1[1],y2[1]),Base.add_sum(y1[2],y2[2]))::Tuple{T,Int}
_dmiss(y) = (ismissing(f(y)) ? zero(T) : f(y), !ismissing(f(y)))::Tuple{T,Bool}
sval, n = mapreduce(_dmiss, _op, x)::Tuple{T, Int}
n == 0 ? missing : sval/n
end
Statistics.mean(x::AbstractArray{Union{T,Missing},1}) where {T <: Number} = mean(identity, x)
then we half the running time:
@btime mapcols(mean, df);
4.325 ms (232 allocations: 23.97 KiB)
Customising the std() function gives even more benefit, (in the following code I used a tweaked version of std())
@btime describe(df, :std);
17.819 ms (1469 allocations: 94.02 KiB)
@btime mapcols(std, df);
4.876 ms (232 allocations: 23.97 KiB)
See https://github.com/JuliaData/DataFrames.jl/pull/2694 and the issued referenced there for a discussion.
The key question is if you need describe
to be super fast? (I do not say it should not be, but it would be good to understand the use case before moving forward with the discussion)
Just one thing to add is that the situation for median should be similar to std
example, i.e. 4x faster, and in this case we can easily save seconds for medium to large data sets.
Now that I have slept over your issue I understand that the core of what you want to highlight is:
julia> x = rand(10^6);
julia> @btime mean($x);
282.190 μs (0 allocations: 0 bytes)
julia> @btime mean(skipmissing($x));
1.034 ms (0 allocations: 0 bytes)
And I agree that this is significant. However, this should be fixed in Statistics.jl and StatsBase.jl as this is a general performance hiccup - @nalimilan what do you think?
I guess I can add a little to this: I think the skipmissing(), mean() ... are supposed to be general functions to work with many different data structures in many different situations, however, in DataFrames we are dealing with rectangle data (and in practice, mostly with numbers and/or string as eltype) and might be easier(?!) to optimise these, specially for mean, std, median,... which are very well defined in the context of data analysis.
just to add an example to demonstrate my point let me go through a simple example. In the following code there is nothing wrong about the count() function, however, in the context of counting the number of missing values, we don't need to use it.
x = rand(10^6)
x = allowmissing(x)
@btime count(ismissing,x);
501.475 μs (0 allocations: 0 bytes)
@btime mapreduce(ismissing,+,x)
124.164 μs (0 allocations: 0 bytes)
This is a very good example, but I think it supports my claim if you investigate it further (I am re-running all to have timings on the same machine):
julia> @btime count(ismissing,$x);
501.200 μs (0 allocations: 0 bytes)
julia> @btime mapreduce(ismissing,+,$x);
87.300 μs (0 allocations: 0 bytes)
julia> @btime count(ismissing,$x,dims=1);
84.600 μs (1 allocation: 96 bytes)
and as you can see the problem is with count
implementation in Julia Base.
In general the design principle we want to stick to is to solve performance issues at their root. In this case ismissing
, Number
, and count
are defined in Julia Base, so the performance problem should be solved in Julia Base.
Similarly mean
is defined in Statistics.jl, so its performance should be solved there. If we started to do otherwise we would immediately hit the problems with bad consequences of type piracy.
Additionally - as shown above - as you can see - the problem with count
is probably rather easy to fix in Julia Base.
Also:
julia> @btime count(Base.Generator(ismissing, x))
71.000 μs (1 allocation: 16 bytes)
See https://github.com/JuliaLang/julia/pull/40564
I agree things should and probably can be improved in Julia rather than in DataFrames. The only change that could be appropriate in DataFrames is to call collect(skipmissing(x))
or view(x, .!ismissing.(x))
internally if paying this cost once is faster given that we call multiple functions (this is something that Base functions cannot do since they don't know this). But AFAICT that's not really the case (#2694).
Also it's worth noting that mean
is relatively complex as it ensures that accumulation is performed in the output type (https://github.com/JuliaLang/Statistics.jl/pull/25). Maybe that hurts performance with skipmissing
and it could be simplified for simple cases like Float64
.
Yes, but it should happen in Statistics.jl.
Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.
Maybe some of these should be customised for DataFrames, e.g. if we are dealing with one dimensional numeric columns then there are less general but faster way to calculate std or q25 or ... in presence of missing values.
Yes, but this should happen in general, not just for DataFrames.jl.
Just a side question, is there any situation that some one working with DataFrame wouldn't like to deal with missings? (i.e. a scenario that f(x::Vector) return missing when any but not all of x is missing is desirable)
I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want missing
to propagate like e.g. in:
julia> sum([1, missing])
missing
?
I am not sure if I understand the question. Do you ask if there are realistic scenarios in which one would want
missing
to propagate like e.g. in:julia> sum([1, missing]) missing
?
yes, something like this.
I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:
> sum(c(1, NA))
[1] NA
> mean(c(1, NA))
[1] NA
I guess the reason is to make sure that missing
do not get silently ignored by the analyst, but rather consciously handled.
I guess this is a commonly agreed standard how functions in statistics realm should work by default. Here is an example from R session:
> sum(c(1, NA)) [1] NA > mean(c(1, NA)) [1] NA
I guess the reason is to make sure that
missing
do not get silently ignored by the analyst, but rather consciously handled.
It may not be commonly agreed standard, since I can not recall any other statistical software with similar behaviour. The problem with this approach to handle the missings
is that we bother the analyst every time to make sure missings
are not ignored. But a simple way to remind analyst this is to display the variables with missings
differently which already DataFrames does by showing different column type.
My point with this is that skipmissing
might not be only approach for DataFrames. Since there is no such a concept as "skipping missings". skipmissing
is basically deleting observation with missing, this work for some functions like sum
(although not efficiently currently) however, if someone likes to subtract each observation from the mean value, skip works only partially. DataFrames approach to missing
values can be handling every thing behind the scene and just keep user informed(e.g. by displaying some notes about the producing missing
during some tasks ).
Please keep in mind that DataFrames.jl does not - and will not - define any statistical functions. It is a package for managing tabular data. I see your point, but simply the functions you ask for are not and will not be defined in DataFrames.jl. They should be defined in separate packages and then they can just be used in DataFrames.jl as any other functions (and if you are willing to implement such a package it would be very interesting to see the comparison). E.g. the current behavior of sum
is defined in Julia Base and it will not change till 2.0 release of Julia (and I assume it is unlikely that such change would be accepted by core Julia developers), so you need a separate package with a separate function definition that would do what you want (the good thing about Julia is that it should be relatively easy to do and it will be fast).
e.g. by displaying some notes about the producing
missing
during some tasks
This is something we will not do. DataFrames.jl does not display any messages when it does its operations. This package is designed for production use where the assumption is that such messages would be never seen. Messages are passed by function return values or errors thrown (and returning missing
is exactly one of such messages).
I understand your points and I am ok with them, but I was thinking about something mild like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of Julia
.
You seem to assume that operations on data frame columns can be faster than on general AbstractArray
or AbstractVector
objects, but that's definitely not the case. And sum(skipmissing(x))
is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean
is slower but I already developed this above).
I understand your points and I ok with them, but I was thinking about something like fast path of aggregation in grouped data frame rather than dealing (or modifying) the larger echo system of
Julia
.
I don't understand what this means.
I don't understand what this means.
Sorry for confusion, let me elaborate this. In DataFrames
when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.
... And
sum(skipmissing(x))
is already very fast, maybe even as fast as it could possibly be even when writing an optimized implementation manually (mean
is slower but I already developed this above).
function sim_sum(x::Vector{Union{T, Missing}}) where T
all(ismissing, x) && return missing
_dmiss(y) = ismissing(y) ? zero(T) : y
mapreduce(_dmiss, Base.add_sum, x)
end
x=rand(10^6)
x=allowmissing(x)
x[rand(1:length(x),1000)].=missing
@btime sum(skipmissing($x))
880.731 μs (5 allocations: 80 bytes)
499297.9721823642
@btime sim_sum(($x))
328.393 μs (0 allocations: 0 bytes)
499297.9721823642
Yes, but this means that we should improve sum(skipmissing(x))
in Julia base. Thank you for pointing this out.
Interesting. This doesn't happen with integers, only with floats. But this is more tricky than it seems due to the requirement to handle -0.0
. See https://github.com/JuliaLang/julia/issues/40599.
Sorry for confusion, let me elaborate this. In
DataFrames
when an aggregation(like sum, mean, ...) is done for a GroupedDataFrame a customised approach is used to speed things up.
A special implementation is used because no equivalent operation exists in Base. That's not the case for things that describe
does.