PooledArrays.jl
PooledArrays.jl copied to clipboard
implement reference offset
A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.
I suggest we add an offset::Int field so this can be handled more generally. This would be added getindex. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.
Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?
See https://github.com/JuliaData/PooledArrays.jl/pull/65 for a recent discussion.
We should resolve the encoding of missing in coordination with your request. The problem is that we also need to handle the cases where unsigned integers are used for storing references (so then 0 is an absolute minimum we can use).
Having said that, at least DataAPI.jl and in particular DataFrames.jl are designed in a way that does not assume that references start at 1 (but PooledArrays.jl currently assumes this AFAICT so making this change should be done very carefully).
Cc: @quinnj about Arrow.
As I had worked a bit more on this, I realized the problem was a bit deeper. Really, to guarantee no copying we need a generic type that can use arbitrary collections for pool and references. Here is a minimal example I implemented for my parquet project:
struct PooledVector{𝒯,ℛ<:AbstractVector{<:Integer},𝒱} <: AbstractVector{𝒯}
pool::𝒱
refs::ℛ
end
PooledVector{𝒯}(vs, rs) where {𝒯} = PooledVector{𝒯,typeof(rs),typeof(vs)}(vs, rs)
PooledVector(vs, rs) = PooledVector{eltype(vs)}(vs, rs)
Base.size(v::PooledVector) = size(v.refs)
Base.IndexStyle(::Type{<:PooledVector}) = IndexLinear()
DataAPI.refarray(v::PooledVector) = v.refs
DataAPI.refpool(v::PooledVector) = v.pool
DataAPI.levels(v::PooledVector) = v.pool
Base.@propagate_inbounds function Base.getindex(v::PooledVector, i::Int)
@boundscheck checkbounds(v, i)
@inbounds v.pool[v.refs[i]+1]
end
(perhaps ironically, I did not bother to include an arbitrary offset). In the case of lazily loaded parquet files, it is not necessarily guaranteed that the pool and refs are of a particular type. An even more general version could included an arbitrary transformation from reference to index, though perhaps that's a bit excessive.
For reference, Arrow.jl has the Arrow.DictEncoded custom "pooled array" type, implemented here. It allows zero-copy view into the arrow data. When you copy one, we convert to PooledArray.