PooledArrays.jl icon indicating copy to clipboard operation
PooledArrays.jl copied to clipboard

Always keep missing in pool

Open bkamins opened this issue 4 years ago • 14 comments

This is WIP to show the changes so that pools can be shared between PooledArrays allowing and disallowing missing values. It seems we can have them without much complications.

Please comment if we would be OK with this design. Then I will finalize if it is OK. In general this will have some small performance penalty (the usual cost of allowing Missing in eltype).

bkamins avatar Apr 30 '21 07:04 bkamins

Hmmm, this approach may complicate construction a bit; for example:

function _invert(d::Dict{K,V}) where {K,V}
    d1 = Vector{K}(undef, length(d))
    for (k, v) in d
        d1[v] = k
    end
    return d1
end

it doesn't seem like we can rely on length(d) here reliably. i.e. they may pass in d that doesn't have missing in it. Or their ref codes might start at 1 even though that's reserved.

quinnj avatar May 13 '21 18:05 quinnj

You mean when one would use PooledArray{T,R,N,RA}(rs::RefArray{RA}, invpool::Dict{T, R}) constructor?

I always thought it was kind of reserved for internal use only

bkamins avatar May 13 '21 18:05 bkamins

Well, internal use and CSV.read 😛 . I think there may be other cases, like data ingestion, where this should be considered: i.e. the invpool is manually accumulated with ref codes, then at the end PooledArray is officially constructed. But that may still be unique to CSV.jl where performance is so critical; i.e. it would be too costly to rely on PooledArray setindex! in hot parsing loop.

I can ensure CSV.read does the right thing, so maybe not that big of a concern.

quinnj avatar May 13 '21 18:05 quinnj

Ok, I'm finishing up the CSV refactor and I think the approach in this PR would work well for how things look on the CSV.jl side. That is, it's nice to have missing reserved as ref 1 as it simplifies a lot of column promotion stuff. It's also nice to just allocate the refpool as Dict{Union{String, Missing}, UInt32} and not have to try and convert it to non-missing at the end if no missings were parsed.

quinnj avatar May 20 '21 20:05 quinnj

But is this PR and release of PooledArrays.jl then required for your changes, or it is OK to do it later? (I am asking, because @nalimilan was not convinced which approach would be best)

bkamins avatar May 20 '21 21:05 bkamins

@quinnj - given you work with this branch I will finalize this PR so that it is not DRAFT. However, I still do not assume we would merge it, as we need to make sure this is a sound design (and I understand the reservations that @nalimilan has - in particular I will have to check if we do not have significant performance regressions anywhere).

bkamins avatar May 22 '21 21:05 bkamins

OK. There are two issues with the approach proposed by me in this PR:

  1. map on PooledArray is problematic (this is fixable probably)
  2. in DataFrames.jl if PooledArray does not allow Missing (but we "technically" allow it) then groupby will be ~3 times slower, as we have to recompute the groupings (we need to drop the unused level that is occupied by missing that cannot be used)

bkamins avatar May 22 '21 22:05 bkamins

Ok, and the alternative approach would be to have ref of 0 be treated as missing, like CategoricalArrays. Or perhaps some other creative approach like having a "global" pool that all PooledArrays could share or something like that.

quinnj avatar May 23 '21 03:05 quinnj

The alternatives are (and I comment on the consequences):

  1. Use a fast copy to allow Missing - this is what @nalimilan preferred: it will then introduce some cost of copying
  2. Use 0 to signal missing like in CategoricalArray: this has two consequences: a) more complex code (but we know how to do it - it is already done in CategoricalArrays.jl), b) the #undef inconsistency between bits and non bits type
  3. Use global pool: actually this approach has the same problems that we have here, i.e.: a. it will be hard to have map implemented correctly b. groupby will be slow (here we have a tension between join and groupby speed - @nalimilan maybe we can think of a better way to drop unused levels in groupby?)

bkamins avatar May 23 '21 07:05 bkamins

@nalimilan maybe we can think of a better way to drop unused levels in groupby?)

If missing is in the first position and it's the only unused level, then I guess just subtracting 1 to group indices would be enough. But that requires a special path in DataFrames, and overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

nalimilan avatar May 23 '21 20:05 nalimilan

then I guess just subtracting 1 to group indices would be enough.

Yes - I have thought about it but it would require that in DataFrames.jl we:

  1. check the version of PooledArrays.jl used
  2. have a two special paths that are PooledArrays.jl specific

That is why I would be hesitant to do it

overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.

Unfortunately this is my general conclusion also (and that we should just use a faster way to create a new refpool/invrefpool). @quinnj - would you be OK in CSV.jl with not having this change?

bkamins avatar May 23 '21 21:05 bkamins

Not related to this particular user case, but I'd love it to see missing values encoded by a reference of zero. It makes it easier to obtain efficient code since pool elements can then share the same concrete type (I guess that's also why SentinelArrays was developed no?). And it is consistent with CategoricalArrays and the ref array used for grouping operations in DataFrames.

matthieugomez avatar Sep 01 '21 14:09 matthieugomez

This is quite a long discussion, apologies for not having read all but Bogumil pointed me here from a Slack thread where a user reported a performance hit in outerjoin from pooling which is probably related:

https://discourse.julialang.org/t/performance-report-effect-of-reading-csv-file-on-mergeing-two-dataframes/106304/9?u=nilshg

Here on a relatively small data set (~60k rows) with 3 pooled columns (String1, String7, String15) we find a 3x regression when outerjoining.

nilshg avatar Nov 17 '23 08:11 nilshg

Yes, there are two issues as I explain in https://bkamins.github.io/julialang/2023/11/17/pooling.html:

  • pooling a value that already has a small memory footprint (like String1) is inefficient;
  • doing outerjoin changes pool and invpool as it requires allowing missing in them, which requires their re-creation.

bkamins avatar Nov 17 '23 09:11 bkamins