arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Severe negative scaling when running Julia threaded.

Open KristofferC opened this issue 1 year ago • 0 comments

Running the following code which generates some data and then reads it via Arrow.Table shows a very bad slow down when using threads:

using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings

function generate_data(f)
    number_of_companies = 10000
    dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
    companyid = sample(100000:1000000, number_of_companies, replace = false)

    number_of_items = length(companyid)*length(dates)

    df = DataFrame(
            dates = repeat(dates, outer = number_of_companies),
            companyid = repeat(companyid, inner = length(dates)),
            item1 = rand(number_of_items),
            item2 = randn(number_of_items),
            item3 = rand(1:1000,number_of_items),
            item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates))
        )

    @info "Saving to $f"
    open(f, "w") do f
        Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
    end
end

f = "mytestdata.arrow"

if !isfile(f)
    generate_data(f)
end

Arrow.Table(f)
@time Arrow.Table(f)

Results:

❯julia arrowthreads.jl
  0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% compilation time)

❯ julia  --project --threads=3 arrowthreads.jl
  6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)

We can see that Arrow.Table spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.

KristofferC avatar Oct 14 '24 13:10 KristofferC