arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Performance regression caused by Arrow type piracy

Open NHDaly opened this issue 2 years ago • 5 comments

Hi! :) 👋

We observed a mysterious performance regression in some benchmarks when we started using Arrow in our codebase, and we have narrowed it down to type piracy on getproperty of Types.

Here's an MRE:

julia> using BenchmarkTools

julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=fieldcount);
  49.939 μs (103 allocations: 12.33 KiB)

julia> using Arrow

julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=fieldcount);
  687.034 μs (103 allocations: 12.33 KiB)

The regression seems to be that calling getproperty on types is now a dynamic dispatch, whereas it was not previously. I believe this is caused by the following method, which prevents julia from devirtualizing the getproperty calls elsewhere: https://github.com/JuliaData/Arrow.jl/blob/8d486c05cf441e78b4764f8a67b9ca903e88dafb/src/FlatBuffers/FlatBuffers.jl#L138

You can see that effect here:

julia> methods(getproperty, (Type, Symbol))
# 1 method for generic function "getproperty":
[1] getproperty(x::Type, f::Symbol) in Base at Base.jl:28

julia> using Arrow

julia> methods(getproperty, (Type, Symbol))
# 11 methods for generic function "getproperty":
[1] getproperty(::Type{Arrow.Flatbuf.DateUnitModule.DateUnit}, sym::Symbol) in Arrow.Flatbuf.DateUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl
:138
[2] getproperty(::Type{Arrow.Flatbuf.BodyCompressionMethodModule.BodyCompressionMethod}, sym::Symbol) in Arrow.Flatbuf.BodyCompressionMethodModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[3] getproperty(::Type{Arrow.Flatbuf.CompressionTypeModule.CompressionType}, sym::Symbol) in Arrow.Flatbuf.CompressionTypeModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[4] getproperty(::Type{Arrow.Flatbuf.IntervalUnitModule.IntervalUnit}, sym::Symbol) in Arrow.Flatbuf.IntervalUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[5] getproperty(::Type{Arrow.Flatbuf.DictionaryKindModule.DictionaryKind}, sym::Symbol) in Arrow.Flatbuf.DictionaryKindModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[6] getproperty(::Type{Arrow.Flatbuf.UnionModeModule.UnionMode}, sym::Symbol) in Arrow.Flatbuf.UnionModeModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[7] getproperty(::Type{Arrow.Flatbuf.MetadataVersionModule.MetadataVersion}, sym::Symbol) in Arrow.Flatbuf.MetadataVersionModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[8] getproperty(::Type{Arrow.Flatbuf.EndiannessModule.Endianness}, sym::Symbol) in Arrow.Flatbuf.EndiannessModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[9] getproperty(::Type{Arrow.Flatbuf.PrecisionModule.Precision}, sym::Symbol) in Arrow.Flatbuf.PrecisionModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[10] getproperty(::Type{Arrow.Flatbuf.TimeUnitModule.TimeUnit}, sym::Symbol) in Arrow.Flatbuf.TimeUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[11] getproperty(x::Type, f::Symbol) in Base at Base.jl:28

I guess this is actually Type Piracy, even though it might not at first seem like it, because the Type type isn't owned by Arrow, even if those parameterizations of Type are unique to this package. 😢 I ran into this same thing in a previous github issue, where Jameson pointed out that you're not supposed to pirate methods on Type, which surprised me too 🤷 but i think it makes sense now: https://github.com/JuliaLang/julia/issues/39377#issuecomment-980493454


For reference, here's one of our functions where we noticed the regression:

# datatype-datatype comparisons
function type_cmp(@nospecialize(ta::DataType), @nospecialize(tb::DataType))
    name_cmp = Base.cmp(ta.name.name, tb.name.name)
    iszero(name_cmp) || return name_cmp
    
    ta_parameters = ta.parameters::Core.SimpleVector
    tb_parameters = tb.parameters::Core.SimpleVector
    n_ta_parameters = length(ta_parameters)
    n_tb_parameters = length(tb_parameters)

    cmp_n_parameters = Base.cmp(n_ta_parameters, n_tb_parameters)
    iszero(cmp_n_parameters) || return(cmp_n_parameters)
    
    for k in 1:n_ta_parameters
        cmp_parameter_pair = type_cmp(ta_parameters[k], tb_parameters[k])
        iszero(cmp_parameter_pair) || return cmp_parameter_pair
    end
    return 0
end

And you can see that one with this small MRE:

julia> using BenchmarkTools

julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=nameof);
  8.677 μs (103 allocations: 12.33 KiB)

julia> using Arrow

julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=nameof);
  359.624 μs (103 allocations: 12.33 KiB)

NHDaly avatar Dec 16 '21 21:12 NHDaly

(This came up when trying to add Arrow to our RelationalAI system, for serializing our results to Arrow in our client/server protocol. We're excited about it! 😁 Thanks for the package! Hopefully we can help with it over time as we find things like this ❤️)

NHDaly avatar Dec 16 '21 22:12 NHDaly

I'm reading up on the thread where this @scopedenum macro comes from: https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/10

Interesting solution to that problem! :)

It looks like @vtjnash identified exactly the same comment that we discovered here too: https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/12

Please consider this a request to redo this feature without pirating getproperty(::Type) 😊

NHDaly avatar Dec 17 '21 16:12 NHDaly

Might make sense to extract scopedenum to a package if you're using it in a few different packages now?

NHDaly avatar Dec 17 '21 16:12 NHDaly

It seems to me that since this macro is only used internally inside the package, and isn't meant to be exposed to users, would it be fine to rewrite this with something simpler, like the baremodule approach?:

baremodule Fruits
using Base: @enum
@enum Fruit Apple Pear Banana
end

julia> Fruits.#<tab><tab>
Apple  Banana  Fruit   Pear

https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/2

sorry to be a bit of a wet blanket here 😊

NHDaly avatar Dec 17 '21 16:12 NHDaly

Maybe this code can be further simplified to reuse more of the Julia Base Enums module as follow?

macro scopedenum(T::Union{Symbol,Expr}, syms...)
    # extract typename and basetype from T
    typename, basetype = if T isa Expr && T.head === :(::) && length(T.args) == 2 && isa(T.args[1], Symbol)
        T.args
    else
        T, nothing
    end
    enumname = Symbol(typename, "Enum")
    T = isnothing(basetype) ? enumname : enumname::basetype
    blk = quote
        @eval module $typename
        export $enumname
        @enum $T $(syms...)
        end # module
    end
    push!(blk.args, :nothing)
    blk.head = :toplevel
    push!(blk.args, :(using .$typename))
    blk
end

baumgold avatar Dec 20 '21 13:12 baumgold

I think this was fixed by https://github.com/apache/arrow-julia/pull/267

ericphanson avatar Apr 09 '23 12:04 ericphanson