PooledArrays.jl icon indicating copy to clipboard operation
PooledArrays.jl copied to clipboard

implement reference offset

Open ExpandingMan opened this issue 4 years ago • 4 comments

A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.

I suggest we add an offset::Int field so this can be handled more generally. This would be added getindex. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.

Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?

ExpandingMan avatar Sep 05 '21 22:09 ExpandingMan

See https://github.com/JuliaData/PooledArrays.jl/pull/65 for a recent discussion. We should resolve the encoding of missing in coordination with your request. The problem is that we also need to handle the cases where unsigned integers are used for storing references (so then 0 is an absolute minimum we can use).

Having said that, at least DataAPI.jl and in particular DataFrames.jl are designed in a way that does not assume that references start at 1 (but PooledArrays.jl currently assumes this AFAICT so making this change should be done very carefully).

bkamins avatar Sep 06 '21 08:09 bkamins

Cc: @quinnj about Arrow.

nalimilan avatar Sep 11 '21 14:09 nalimilan

As I had worked a bit more on this, I realized the problem was a bit deeper. Really, to guarantee no copying we need a generic type that can use arbitrary collections for pool and references. Here is a minimal example I implemented for my parquet project:

struct PooledVector{𝒯,ℛ<:AbstractVector{<:Integer},𝒱} <: AbstractVector{𝒯}
    pool::𝒱
    refs::ℛ
end

PooledVector{𝒯}(vs, rs) where {𝒯} = PooledVector{𝒯,typeof(rs),typeof(vs)}(vs, rs)     
PooledVector(vs, rs) = PooledVector{eltype(vs)}(vs, rs)

Base.size(v::PooledVector) = size(v.refs)

Base.IndexStyle(::Type{<:PooledVector}) = IndexLinear()

DataAPI.refarray(v::PooledVector) = v.refs
DataAPI.refpool(v::PooledVector) = v.pool
DataAPI.levels(v::PooledVector) = v.pool

Base.@propagate_inbounds function Base.getindex(v::PooledVector, i::Int)
    @boundscheck checkbounds(v, i)
    @inbounds v.pool[v.refs[i]+1]
end

(perhaps ironically, I did not bother to include an arbitrary offset). In the case of lazily loaded parquet files, it is not necessarily guaranteed that the pool and refs are of a particular type. An even more general version could included an arbitrary transformation from reference to index, though perhaps that's a bit excessive.

ExpandingMan avatar Sep 11 '21 15:09 ExpandingMan

For reference, Arrow.jl has the Arrow.DictEncoded custom "pooled array" type, implemented here. It allows zero-copy view into the arrow data. When you copy one, we convert to PooledArray.

quinnj avatar Sep 16 '21 05:09 quinnj