arrow-julia
arrow-julia copied to clipboard
Performance regression caused by Arrow type piracy
Hi! :) 👋
We observed a mysterious performance regression in some benchmarks when we started using Arrow
in our codebase, and we have narrowed it down to type piracy on getproperty
of Types.
Here's an MRE:
julia> using BenchmarkTools
julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=fieldcount);
49.939 μs (103 allocations: 12.33 KiB)
julia> using Arrow
julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=fieldcount);
687.034 μs (103 allocations: 12.33 KiB)
The regression seems to be that calling getproperty
on types is now a dynamic dispatch, whereas it was not previously. I believe this is caused by the following method, which prevents julia from devirtualizing the getproperty calls elsewhere:
https://github.com/JuliaData/Arrow.jl/blob/8d486c05cf441e78b4764f8a67b9ca903e88dafb/src/FlatBuffers/FlatBuffers.jl#L138
You can see that effect here:
julia> methods(getproperty, (Type, Symbol))
# 1 method for generic function "getproperty":
[1] getproperty(x::Type, f::Symbol) in Base at Base.jl:28
julia> using Arrow
julia> methods(getproperty, (Type, Symbol))
# 11 methods for generic function "getproperty":
[1] getproperty(::Type{Arrow.Flatbuf.DateUnitModule.DateUnit}, sym::Symbol) in Arrow.Flatbuf.DateUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl
:138
[2] getproperty(::Type{Arrow.Flatbuf.BodyCompressionMethodModule.BodyCompressionMethod}, sym::Symbol) in Arrow.Flatbuf.BodyCompressionMethodModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[3] getproperty(::Type{Arrow.Flatbuf.CompressionTypeModule.CompressionType}, sym::Symbol) in Arrow.Flatbuf.CompressionTypeModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[4] getproperty(::Type{Arrow.Flatbuf.IntervalUnitModule.IntervalUnit}, sym::Symbol) in Arrow.Flatbuf.IntervalUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[5] getproperty(::Type{Arrow.Flatbuf.DictionaryKindModule.DictionaryKind}, sym::Symbol) in Arrow.Flatbuf.DictionaryKindModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[6] getproperty(::Type{Arrow.Flatbuf.UnionModeModule.UnionMode}, sym::Symbol) in Arrow.Flatbuf.UnionModeModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[7] getproperty(::Type{Arrow.Flatbuf.MetadataVersionModule.MetadataVersion}, sym::Symbol) in Arrow.Flatbuf.MetadataVersionModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[8] getproperty(::Type{Arrow.Flatbuf.EndiannessModule.Endianness}, sym::Symbol) in Arrow.Flatbuf.EndiannessModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[9] getproperty(::Type{Arrow.Flatbuf.PrecisionModule.Precision}, sym::Symbol) in Arrow.Flatbuf.PrecisionModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[10] getproperty(::Type{Arrow.Flatbuf.TimeUnitModule.TimeUnit}, sym::Symbol) in Arrow.Flatbuf.TimeUnitModule at /Users/nathandaly/.julia/dev/Arrow/src/FlatBuffers/FlatBuffers.jl:138
[11] getproperty(x::Type, f::Symbol) in Base at Base.jl:28
I guess this is actually Type Piracy, even though it might not at first seem like it, because the Type
type isn't owned by Arrow, even if those parameterizations of Type are unique to this package. 😢 I ran into this same thing in a previous github issue, where Jameson pointed out that you're not supposed to pirate methods on Type
, which surprised me too 🤷 but i think it makes sense now:
https://github.com/JuliaLang/julia/issues/39377#issuecomment-980493454
For reference, here's one of our functions where we noticed the regression:
# datatype-datatype comparisons
function type_cmp(@nospecialize(ta::DataType), @nospecialize(tb::DataType))
name_cmp = Base.cmp(ta.name.name, tb.name.name)
iszero(name_cmp) || return name_cmp
ta_parameters = ta.parameters::Core.SimpleVector
tb_parameters = tb.parameters::Core.SimpleVector
n_ta_parameters = length(ta_parameters)
n_tb_parameters = length(tb_parameters)
cmp_n_parameters = Base.cmp(n_ta_parameters, n_tb_parameters)
iszero(cmp_n_parameters) || return(cmp_n_parameters)
for k in 1:n_ta_parameters
cmp_parameter_pair = type_cmp(ta_parameters[k], tb_parameters[k])
iszero(cmp_parameter_pair) || return cmp_parameter_pair
end
return 0
end
And you can see that one with this small MRE:
julia> using BenchmarkTools
julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=nameof);
8.677 μs (103 allocations: 12.33 KiB)
julia> using Arrow
julia> @btime sort!([rand([Int,Float64,Int8,Int32]) for _ in 1:100], by=nameof);
359.624 μs (103 allocations: 12.33 KiB)
(This came up when trying to add Arrow to our RelationalAI system, for serializing our results to Arrow in our client/server protocol. We're excited about it! 😁 Thanks for the package! Hopefully we can help with it over time as we find things like this ❤️)
I'm reading up on the thread where this @scopedenum
macro comes from:
https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/10
Interesting solution to that problem! :)
It looks like @vtjnash identified exactly the same comment that we discovered here too: https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/12
Please consider this a request to redo this feature without pirating getproperty(::Type)
😊
Might make sense to extract scopedenum to a package if you're using it in a few different packages now?
It seems to me that since this macro is only used internally inside the package, and isn't meant to be exposed to users, would it be fine to rewrite this with something simpler, like the baremodule
approach?:
baremodule Fruits
using Base: @enum
@enum Fruit Apple Pear Banana
end
julia> Fruits.#<tab><tab>
Apple Banana Fruit Pear
https://discourse.julialang.org/t/encapsulating-enum-access-via-dot-syntax/11785/2
sorry to be a bit of a wet blanket here 😊
Maybe this code can be further simplified to reuse more of the Julia Base Enums module as follow?
macro scopedenum(T::Union{Symbol,Expr}, syms...)
# extract typename and basetype from T
typename, basetype = if T isa Expr && T.head === :(::) && length(T.args) == 2 && isa(T.args[1], Symbol)
T.args
else
T, nothing
end
enumname = Symbol(typename, "Enum")
T = isnothing(basetype) ? enumname : enumname::basetype
blk = quote
@eval module $typename
export $enumname
@enum $T $(syms...)
end # module
end
push!(blk.args, :nothing)
blk.head = :toplevel
push!(blk.args, :(using .$typename))
blk
end
I think this was fixed by https://github.com/apache/arrow-julia/pull/267