JuliaDB.jl table creation extreme slowdown

table creation extreme slowdown

Open pdimens opened this issue 4 years ago • 12 comments

I've run into a strange (and tremendous) performance hit with JuliaDB/IndexedTables this past week. Previously, with a custom reading function using JuliaDB, CSV, and CategoricalArrays, it took ~1.6sec to read in a Genepop file. A few days ago I started working on the package again after updating the deps and that same file reader now clocks in at 120 seconds. I've been trying to figure out where this ~100x performance drop came from, and while my diagnosing skills are limited, I've come to a few conclusions. The original benchmark:

julia> @btime x = genepop(infile, silent = true) 
 1.445 s (9342559 allocations: 386.67 MiB)

Here is the table in question, and the function benchmarked to produce it as a DataFrame:

julia> @btime a = genepop(infile, silent = true)
  2.102 s (14998451 allocations: 692.77 MiB)
469156×4 DataFrame
│ Row    │ name    │ population │ variable     │ value   │
│        │ Cat…    │ Cat…       │ Categorical… │ Tuple…? │
├────────┼─────────┼────────────┼──────────────┼─────────┤
│ 1      │ cc_001  │ 1          │ contig_35208 │ (1, 2)  │
│ 2      │ cc_002  │ 1          │ contig_35208 │ (1, 2)  │
│ 3      │ cc_003  │ 1          │ contig_35208 │ (1, 1)  │
⋮
│ 469153 │ seg_028 │ 7          │ contig_2784  │ missing │
│ 469154 │ seg_029 │ 7          │ contig_2784  │ (1, 1)  │
│ 469155 │ seg_030 │ 7          │ contig_2784  │ (1, 1)  │
│ 469156 │ seg_031 │ 7          │ contig_2784  │ (1, 1)  │

when calling table from JuliaDB/IndexedTable to convert a long-format DataFrame into an IndexedTable, it's a pretty instantaneous process, even if the DataFrame is 4col x 500,000rows.

# DataFrame => table
julia> @btime a = genepop(infile, silent = true)
  2.287 s (14998597 allocations: 696.36 MiB)
Table with 469156 rows, 4 columns:
name       population  variable        value
─────────────────────────────────────────────
"cc_001"   "1"         "contig_35208"  (1, 2)
"cc_001"   "1"         "contig_23109"  (1, 1)
"cc_001"   "1"         "contig_4493"   (1, 2)
"cc_001"   "1"         "contig_10742"  (1, 1)
"cc_001"   "1"         "contig_14898"  (1, 2)
⋮
"seg_031"  "7"         "contig_19384"  (2, 2)
"seg_031"  "7"         "contig_22368"  (1, 1)
"seg_031"  "7"         "contig_2784"   (1, 1)

when calling table to create the IndexedTable by referencing the specific columns of the DataFrame (so, by specifying vectors), that's when you see this shift from 1.6sec to 105sec: DataFrame.columns => table (i.e. table(name = df.name, population = df.population...)

julia> @btime a = genepop(infile, silent = true)
  105.204 s (16408556 allocations: 744.88 MiB)
Table with 469156 rows, 4 columns:
name       population  locus           genotype
───────────────────────────────────────────────
"cc_001"   "1"         "contig_35208"  (1, 2)
"cc_001"   "1"         "contig_23109"  (1, 1)
"cc_001"   "1"         "contig_4493"   (1, 2)
"cc_001"   "1"         "contig_10742"  (1, 1)
"cc_001"   "1"         "contig_14898"  (1, 2)
⋮
"seg_031"  "7"         "contig_19384"  (2, 2)
"seg_031"  "7"         "contig_22368"  (1, 1)
"seg_031"  "7"         "contig_2784"   (1, 1)

Interestingly, if you don't make the columns Categorical, the performance improves to ~68sec

julia> @btime a = genepop(infile, silent = true)
  68.852 s (14530942 allocations: 693.71 MiB)
Table with 469156 rows, 4 columns:
name       population  locus           genotype
───────────────────────────────────────────────
"cc_001"   "1"         "contig_35208"  (1, 2)
"cc_001"   "1"         "contig_23109"  (1, 1)
"cc_001"   "1"         "contig_4493"   (1, 2)
"cc_001"   "1"         "contig_10742"  (1, 1)
"cc_001"   "1"         "contig_14898"  (1, 2)
⋮
"seg_031"  "7"         "contig_19384"  (2, 2)
"seg_031"  "7"         "contig_22368"  (1, 1)
"seg_031"  "7"         "contig_2784"   (1, 1)

However, a few weeks ago, all of this took ~1.6sec and didn't require the intervention of DataFrames. Using DataFrames to get things functional before creating an IndexedTable isn't a realistic long term solution. Hopefully this information can help inform someone with more understanding of JuliaDB to identify what's going on. I can provide whatever code/profiles are necessary.

May 12 '20 01:05 pdimens

Here is a profile/flamegraph produced by StatProfilerHTML.jl of just the IndexedTable creation using 4 vectors.

JuliaDB_slowdown_profile.zip

And a static image of it:

May 13 '20 16:05 pdimens

It may be what's changed is that now by default StructArrays collects categorical values into categorical arrays, and it looks like categorical arrays get some performance penalty somehow upon collection. Does the time spent in issubset come from CategoricalArrays?

I definitely don't understand CategoricalArrays well enough, but performance seems fine at least on the latest version. Can you produce some sample dataset (generate at random with some snippet) that iterates rows where DataFrame(iter) (or Tables.columntable(iter)) is much faster than JuliaDB.table(iter), or StructArrays(iter)?

As a stop-gap, you can always do JuliaDB.table(Tables.columntable(iter)) so that the collection is done by Tables (maybe that should become the default at some point).

May 13 '20 17:05 piever

Thanks. Even without the Categorical stuff, the time benchmarks to ~60sec, compared to the ~1sec before. I'll have to work a bit to provide the other information, but here's some of the profile info for issubset:

23496 (96 %) samples spent in issubset
23496 (100 %) (incl.) when called from ⊈ line 341
39 (0 %) samples spent calling issubset
10 (0 %) samples spent calling issubset
23447 (100 %) samples spent calling issubset
Base.issubset(a::CategoricalPool, b::CategoricalPool) = issubset(a.levels, keys(b.invindex))

23447 (96 %) samples spent in issubset
33 (0 %) (ex.), 23447 (100 %) (incl.) when called from issubset line 183
487 (2 %) samples spent calling iterate
22927 (98 %) samples spent calling in        

    elt in r \|\| return false

May 13 '20 18:05 pdimens

Here is a MWE with some benchmarking. I should mention that JuliaDB is version 0.13.0

using JuliaDB, BenchmarkTools, CategoricalArrays

function tabletest()
    names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
    loci = "loc".*string.(collect(1:150000))
    loci = fill.(loci, 3) |> Base.Iterators.flatten |> collect
    genotypes = fill((1,2), 450000)

    return table((names = names, loci = loci, genotypes = genotypes), pkey = :names)
end

function tabletest_cat()
    names = fill.(["red","blue","green"], 150000) |> Base.Iterators.flatten |> collect
    loci = "loc".*string.(collect(1:150000))
    loci = fill.(loci, 3) |> Base.Iterators.flatten |> collect
    genotypes = fill((1,2), 450000)

    return table(
            (names = categorical(names, compress = true), 
            loci = categorical(loci, compress = true), 
            genotypes = genotypes),
            pkey = :names
        )
end

julia> @benchmark tabletest()
BenchmarkTools.Trial: 
  memory estimate:  79.81 MiB
  allocs estimate:  900109
  --------------
  minimum time:     246.040 ms (7.45% GC)
  median time:      270.271 ms (12.39% GC)
  mean time:        280.085 ms (15.80% GC)
  maximum time:     436.054 ms (44.68% GC)
  --------------
  samples:          18
  evals/sample:     1

Howerver, doing @benchmark tabletest_cat() has been running for over an hour and hasn't finished yet. Probably not a good sign.

May 13 '20 20:05 pdimens

So, this seems to be the simplest way to reproduce:

julia> using CategoricalArrays, StructArrays

julia> N = 500_000;

julia> values = categorical(string.(1:N));

julia> sa = StructArray((values=values,));

julia> sa[1:N] # takes forever

I think the issue is that normally CategoricalArrays has a fast getindex method, which does not get called if it is wrapped in a StructArray. Here it is particularly bad, because it is a categorical array with 500_000 unique values, which I think is what is causing the slow down in your code (https://github.com/JuliaData/WeakRefStrings.jl is probably a more sensible approach for that kind of string data).

Maybe StructArray could try and overload getindex a bit earlier to already intercept the vectorized getindex call and forward that to the columns, but that also has drawbacks (the "vectorized to scalar" getindex conversion would happen for each column).

May 13 '20 21:05 piever

Interesting. What changes could have been made in the last 2-3 weeks that would have caused this? This issue is only recent on my end.

May 13 '20 22:05 pdimens

What changes could have been made in the last 2-3 weeks that would have caused this? This issue is only recent on my end.

No idea! Probably some change either on StructArrays or CategoricalArrays (I think CategoricalArrays released a new version recently).

May 14 '20 11:05 piever

Would there be merit to me posting this there as well?

May 14 '20 12:05 pdimens

It probably makes sense to file an issue on StructArrays to discuss non-scalar getindex.

May 14 '20 12:05 piever

This is due to the fact that now CategoricalArrays merge pools from source and destination CategoricalArrays/CategoricalValues in setindex!. This operation is slow as it requires checking whether the levels of the source are a subset of the destination. The solution to this is to keep a global table of subset relations between pools to avoid repeating the computation, but that requires some work (and it's not clear how to make this thread-safe).

Another solution which could work in the meantime and would be a good idea anyway is to have StructArrays call a higher-level function on the underlying arrays, like getindex with a vector of indices or copyto!. That way the optimized CategoricalArray methods will be used, and they are even faster than what we could achieve with a global table. That should also speed up other custom array types. Do you think that's possible @piever?

EDIT: Also note that CategoricalArrays has been designed with the principle that the number of levels is supposed to be very small compared to the number of elements. If that's not the case, you'd better use another structure.

May 19 '20 10:05 nalimilan

Another solution which could work in the meantime and would be a good idea anyway is to have StructArrays call a higher-level function on the underlying arrays, like getindex with a vector of indices or copyto!. That way the optimized CategoricalArray methods will be used, and they are even faster than what we could achieve with a global table. That should also speed up other custom array types. Do you think that's possible @piever?

I've been thinking about that as well. What happens now is that the overloads of getindex for StructArrays are minimal: only getindex(v, i::Int) or getindex(v, I::Integer...), depending on IndexStyle.

This seems a similar approach to e.g. OffsetArrays. The drawback, which I probably underestimated at the time, is, as you noticed, that this interacts poorly with arrays that, for performance reasons, need to define a custom "multi-indices" getindex. CategoricalArrays (due to this pooling considerations) and CuArrays (where getting elements one by one from the GPU is very costly) are very relevant examples here.

The reason of my design decision is that if some transformation is needed (for example from linear to Cartesian index or vice versa) it can be performed already at getindex(v::StructArray, args...) by Base machinery, which means it happens once.

If I redirected getindex(v::StructArray, args...) to a custom implementation that passes more complex indices to the various arrays, do you know if one can trust the compiler to optimize the common linear to cartesian (or viceversa) computation? Otherwise, one would need to do this optimization within StructArrays, which could be annoying.

A second concern is that it wouldn't be completely obvious (to me at least) from the array signature how to tell whether it is a "scalar getindex" or a "multi-indices getindex", because in the second case I would want to wrap the result into a StructArray. Is there a simple way to tell from the signature of the call whether it is a "scalar getindex call"?

May 19 '20 16:05 piever

What I did in CategoricalArrays is that I called getindex(A.refs, args...), and check whether the result is an array. Not sure whether that could work for StructArrays too.

Another solution would be to override Base._unsafe_getindex or Base._unsafe_getindex!, which are called. This sounds relatively safe to do, but unfortunately these aren't part of the public API. Maybe @mbauman could help us? :-p

Regarding the computation of linear/cartesian indices, I think it would be worth checking whether the compiler is able to generate efficient code for the scalar case: the generated code is quite simple (a few additions), so it may well be able to eliminate redundant conversions. Anyway, for the non-scalar case, I'm not sure it's big deal to make the conversion multiple times, since the cost should be negligible compared to allocating the new slice. What matters is that the right kind of indices is used for each underlying array, and for this computing indices for each array can be a big win. If compiler optimizations for scalars aren't good enough, what you could do is to call invoke on the fallback getindex method if you detect scalar indexing, and otherwise use a custom code which calls getindex on each underlying array.

May 20 '20 08:05 nalimilan

JuliaDB.jl JuliaDB.jl copied to clipboard

table creation extreme slowdown

JuliaDB.jl
JuliaDB.jl copied to clipboard