StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

`countmap` performance issues?

Open xiaodaigh opened this issue 5 years ago • 11 comments

a = rand(Int32, 200_000_000)

using SortingAlgorithms

@time sort(a, alg = RadixSort) # 4s

@time countmap(a); # 35s

It seems like countmap is an order of magnitude slower than straight sorting. This doesn't make sense. Has countmap performance regressed?

Replicated in 1.2 and 1.3-rc1.

xiaodaigh avatar Aug 30 '19 14:08 xiaodaigh

Sorting is one thing, but even once the data is sorted you need to count unique values. In your example, most values appear only a few times, so most of the work has to be done after sorting.

nalimilan avatar Sep 04 '19 09:09 nalimilan

That said, it should be possible to increase performance significantly by adding a special case for countmap. Currently it uses the same code as addcounts!, which means that a dict lookup has to be done for each unique value, even though we know it cannot be there when we started with an empty dict. This line would benefit from the get call being made conditional on cm not being empty at the function entry: https://github.com/JuliaStats/StatsBase.jl/blob/fde620993a3037cc236f0c1904ece54d2e416b5c/src/counts.jl#L339

We could also use ht_keyindex2! like in addcounts_dict! to make things faster in the general case where the value may be present, but that's a separate issue.

nalimilan avatar Sep 04 '19 10:09 nalimilan

Maybe just sum it up in an array first. Cos it's already sorted. Then just add the array one by one into the dict. So no need to look up 1 billion times, and we know there wouldn't any clashes.

xiaodaigh avatar Sep 04 '19 12:09 xiaodaigh

Since values are unique, wouldn't inserting values from the array into the dict be strictly equivalent to adding them directly without creating a temporary array? (We only do a lookup once for each unique value: that's the whole point of sorting the data first.)

nalimilan avatar Sep 04 '19 13:09 nalimilan

Yeah. Just count them and insert. Should be quick.

xiaodaigh avatar Sep 04 '19 13:09 xiaodaigh

Perhaps this belongs in a new issue, but here is another countmap performance issue:

x = [rand(1:10, 10) for i in 1:30000];
@elapsed countmap.(x) # 1.4s
@elapsed countmap.(x; alg=:dict) # 0.025s

Probably due to

using SortingAlgorithms
 @elapsed sort.(x) # 0.015s
 @elapsed sort.(x; alg=RadixSort) # 1.4s

Perhaps we should wait for SortingAlgorithms.jl to get faster, but why do we use RadixSort instead of Base.sort? Is there ever a measurable improvement?

LilithHafner avatar Sep 30 '21 13:09 LilithHafner

The SortingAlgorithms issue about this: https://github.com/JuliaCollections/SortingAlgorithms.jl/issues/50

LilithHafner avatar Sep 30 '21 13:09 LilithHafner

Is there ever a measurable improvement?

Try more than 10 elements. More like 1 million.

xiaodaigh avatar Sep 30 '21 14:09 xiaodaigh

We could use the fallback algorithm which is based on Dict when the number of entries is small. Should be easy to do if you're willing to make a PR. Apparently it becomes faster below around 100 entries:

julia> x = rand(1:10, 100);

julia> @btime countmap(x, alg=:dict);
  541.827 ns (4 allocations: 544 bytes)

julia> @btime countmap(x, alg=:radixsort);
  473.321 ns (6 allocations: 1.55 KiB)

nalimilan avatar Sep 30 '21 14:09 nalimilan

Unfortunately, the runtime analysis is not that clear cut. Both methods' runtimes depend on both the number of elements and the number of unique elements.

julia> x = rand(1:100, 100);

julia> @btime countmap(x, alg=:dict);
  1.683 μs (10 allocations: 6.58 KiB)

julia> @btime countmap(x, alg=:radixsort);
  10.597 μs (16 allocations: 80.66 KiB)

When the average number of coppies of each element stays low, radix sort never seems to get much better than dict, even at 100 million elements:

for i in 1:8
           n = Int(floor(10^i))
           a = @belapsed countmap(x; alg=:dict) setup=(x=rand(1:$n, $n)) seconds=1
           b = @belapsed countmap(x; alg=:radixsort) setup=(x=rand(1:$n, $n)) seconds=1
           println(i, ", ", a, ", ", b, ", ", a/b)
end
1, 1.9358407079646017e-7, 9.1e-7, 0.2127297481279782
2, 1.9805999999999997e-6, 1.1715e-5, 0.16906530089628677
3, 2.2033e-5, 4.0679e-5, 0.5416308168834042
4, 0.000249731, 0.000385755, 0.6473824059312258
5, 0.004073793, 0.004781494, 0.8519916578374876
6, 0.067651524, 0.067441364, 1.0031161884566866
7, 1.576012225, 1.651849665, 0.9540893813723659
8, 25.634085222, 26.283673304, 0.975285490940068

Perhaps on a computer with much more RAM than mine radix sort would get better at 10-100 million elements, but I doubt radix is frequently the right choice when the average number of copies is low.

LilithHafner avatar Sep 30 '21 16:09 LilithHafner

I think the idea is that the most common case is to have few unique values compared to the number of entries. If that's not the case, alg=:dict is there to choose another algorithm. The docs could mention this.

nalimilan avatar Sep 30 '21 17:09 nalimilan