PooledArrays.jl
PooledArrays.jl copied to clipboard
Always keep missing in pool
This is WIP to show the changes so that pools can be shared between PooledArrays allowing and disallowing missing values. It seems we can have them without much complications.
Please comment if we would be OK with this design. Then I will finalize if it is OK. In general this will have some small performance penalty (the usual cost of allowing Missing in eltype).
Hmmm, this approach may complicate construction a bit; for example:
function _invert(d::Dict{K,V}) where {K,V}
d1 = Vector{K}(undef, length(d))
for (k, v) in d
d1[v] = k
end
return d1
end
it doesn't seem like we can rely on length(d) here reliably. i.e. they may pass in d that doesn't have missing in it. Or their ref codes might start at 1 even though that's reserved.
You mean when one would use PooledArray{T,R,N,RA}(rs::RefArray{RA}, invpool::Dict{T, R}) constructor?
I always thought it was kind of reserved for internal use only
Well, internal use and CSV.read 😛 . I think there may be other cases, like data ingestion, where this should be considered: i.e. the invpool is manually accumulated with ref codes, then at the end PooledArray is officially constructed. But that may still be unique to CSV.jl where performance is so critical; i.e. it would be too costly to rely on PooledArray setindex! in hot parsing loop.
I can ensure CSV.read does the right thing, so maybe not that big of a concern.
Ok, I'm finishing up the CSV refactor and I think the approach in this PR would work well for how things look on the CSV.jl side. That is, it's nice to have missing reserved as ref 1 as it simplifies a lot of column promotion stuff. It's also nice to just allocate the refpool as Dict{Union{String, Missing}, UInt32} and not have to try and convert it to non-missing at the end if no missings were parsed.
But is this PR and release of PooledArrays.jl then required for your changes, or it is OK to do it later? (I am asking, because @nalimilan was not convinced which approach would be best)
@quinnj - given you work with this branch I will finalize this PR so that it is not DRAFT. However, I still do not assume we would merge it, as we need to make sure this is a sound design (and I understand the reservations that @nalimilan has - in particular I will have to check if we do not have significant performance regressions anywhere).
OK. There are two issues with the approach proposed by me in this PR:
maponPooledArrayis problematic (this is fixable probably)- in DataFrames.jl if
PooledArraydoes not allowMissing(but we "technically" allow it) thengroupbywill be ~3 times slower, as we have to recompute the groupings (we need to drop the unused level that is occupied bymissingthat cannot be used)
Ok, and the alternative approach would be to have ref of 0 be treated as missing, like CategoricalArrays. Or perhaps some other creative approach like having a "global" pool that all PooledArrays could share or something like that.
The alternatives are (and I comment on the consequences):
- Use a fast copy to allow
Missing- this is what @nalimilan preferred: it will then introduce some cost of copying - Use
0to signal missing like inCategoricalArray: this has two consequences: a) more complex code (but we know how to do it - it is already done in CategoricalArrays.jl), b) the#undefinconsistency between bits and non bits type - Use global pool: actually this approach has the same problems that we have here, i.e.:
a. it will be hard to have
mapimplemented correctly b.groupbywill be slow (here we have a tension betweenjoinandgroupbyspeed - @nalimilan maybe we can think of a better way to drop unused levels ingroupby?)
@nalimilan maybe we can think of a better way to drop unused levels in
groupby?)
If missing is in the first position and it's the only unused level, then I guess just subtracting 1 to group indices would be enough. But that requires a special path in DataFrames, and overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.
then I guess just subtracting 1 to group indices would be enough.
Yes - I have thought about it but it would require that in DataFrames.jl we:
- check the version of PooledArrays.jl used
- have a two special paths that are PooledArrays.jl specific
That is why I would be hesitant to do it
overall I'm still worried that this PR may introduce slowdowns and/or increase complexity in many places.
Unfortunately this is my general conclusion also (and that we should just use a faster way to create a new refpool/invrefpool). @quinnj - would you be OK in CSV.jl with not having this change?
Not related to this particular user case, but I'd love it to see missing values encoded by a reference of zero. It makes it easier to obtain efficient code since pool elements can then share the same concrete type (I guess that's also why SentinelArrays was developed no?). And it is consistent with CategoricalArrays and the ref array used for grouping operations in DataFrames.
This is quite a long discussion, apologies for not having read all but Bogumil pointed me here from a Slack thread where a user reported a performance hit in outerjoin from pooling which is probably related:
https://discourse.julialang.org/t/performance-report-effect-of-reading-csv-file-on-mergeing-two-dataframes/106304/9?u=nilshg
Here on a relatively small data set (~60k rows) with 3 pooled columns (String1, String7, String15) we find a 3x regression when outerjoining.
Yes, there are two issues as I explain in https://bkamins.github.io/julialang/2023/11/17/pooling.html:
- pooling a value that already has a small memory footprint (like
String1) is inefficient; - doing
outerjoinchangespoolandinvpoolas it requires allowingmissingin them, which requires their re-creation.