arrow-julia
arrow-julia copied to clipboard
Severe negative scaling when running Julia threaded.
Running the following code which generates some data and then reads it via Arrow.Table shows a very bad slow down when using threads:
using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings
function generate_data(f)
number_of_companies = 10000
dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
companyid = sample(100000:1000000, number_of_companies, replace = false)
number_of_items = length(companyid)*length(dates)
df = DataFrame(
dates = repeat(dates, outer = number_of_companies),
companyid = repeat(companyid, inner = length(dates)),
item1 = rand(number_of_items),
item2 = randn(number_of_items),
item3 = rand(1:1000,number_of_items),
item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates))
)
@info "Saving to $f"
open(f, "w") do f
Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
end
end
f = "mytestdata.arrow"
if !isfile(f)
generate_data(f)
end
Arrow.Table(f)
@time Arrow.Table(f)
Results:
❯julia arrowthreads.jl
0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% compilation time)
❯ julia --project --threads=3 arrowthreads.jl
6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)
We can see that Arrow.Table spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.