StatsBase.jl
StatsBase.jl copied to clipboard
`countmap` performance issues?
a = rand(Int32, 200_000_000)
using SortingAlgorithms
@time sort(a, alg = RadixSort) # 4s
@time countmap(a); # 35s
It seems like countmap
is an order of magnitude slower than straight sorting. This doesn't make sense. Has countmap
performance regressed?
Replicated in 1.2 and 1.3-rc1.
Sorting is one thing, but even once the data is sorted you need to count unique values. In your example, most values appear only a few times, so most of the work has to be done after sorting.
That said, it should be possible to increase performance significantly by adding a special case for countmap
. Currently it uses the same code as addcounts!
, which means that a dict lookup has to be done for each unique value, even though we know it cannot be there when we started with an empty dict. This line would benefit from the get
call being made conditional on cm
not being empty at the function entry:
https://github.com/JuliaStats/StatsBase.jl/blob/fde620993a3037cc236f0c1904ece54d2e416b5c/src/counts.jl#L339
We could also use ht_keyindex2!
like in addcounts_dict!
to make things faster in the general case where the value may be present, but that's a separate issue.
Maybe just sum it up in an array first. Cos it's already sorted. Then just add the array one by one into the dict
. So no need to look up 1 billion times, and we know there wouldn't any clashes.
Since values are unique, wouldn't inserting values from the array into the dict be strictly equivalent to adding them directly without creating a temporary array? (We only do a lookup once for each unique value: that's the whole point of sorting the data first.)
Yeah. Just count them and insert. Should be quick.
Perhaps this belongs in a new issue, but here is another countmap performance issue:
x = [rand(1:10, 10) for i in 1:30000];
@elapsed countmap.(x) # 1.4s
@elapsed countmap.(x; alg=:dict) # 0.025s
Probably due to
using SortingAlgorithms
@elapsed sort.(x) # 0.015s
@elapsed sort.(x; alg=RadixSort) # 1.4s
Perhaps we should wait for SortingAlgorithms.jl to get faster, but why do we use RadixSort
instead of Base.sort
? Is there ever a measurable improvement?
The SortingAlgorithms issue about this: https://github.com/JuliaCollections/SortingAlgorithms.jl/issues/50
Is there ever a measurable improvement?
Try more than 10 elements. More like 1 million.
We could use the fallback algorithm which is based on Dict
when the number of entries is small. Should be easy to do if you're willing to make a PR. Apparently it becomes faster below around 100 entries:
julia> x = rand(1:10, 100);
julia> @btime countmap(x, alg=:dict);
541.827 ns (4 allocations: 544 bytes)
julia> @btime countmap(x, alg=:radixsort);
473.321 ns (6 allocations: 1.55 KiB)
Unfortunately, the runtime analysis is not that clear cut. Both methods' runtimes depend on both the number of elements and the number of unique elements.
julia> x = rand(1:100, 100);
julia> @btime countmap(x, alg=:dict);
1.683 μs (10 allocations: 6.58 KiB)
julia> @btime countmap(x, alg=:radixsort);
10.597 μs (16 allocations: 80.66 KiB)
When the average number of coppies of each element stays low, radix sort never seems to get much better than dict, even at 100 million elements:
for i in 1:8
n = Int(floor(10^i))
a = @belapsed countmap(x; alg=:dict) setup=(x=rand(1:$n, $n)) seconds=1
b = @belapsed countmap(x; alg=:radixsort) setup=(x=rand(1:$n, $n)) seconds=1
println(i, ", ", a, ", ", b, ", ", a/b)
end
1, 1.9358407079646017e-7, 9.1e-7, 0.2127297481279782
2, 1.9805999999999997e-6, 1.1715e-5, 0.16906530089628677
3, 2.2033e-5, 4.0679e-5, 0.5416308168834042
4, 0.000249731, 0.000385755, 0.6473824059312258
5, 0.004073793, 0.004781494, 0.8519916578374876
6, 0.067651524, 0.067441364, 1.0031161884566866
7, 1.576012225, 1.651849665, 0.9540893813723659
8, 25.634085222, 26.283673304, 0.975285490940068
Perhaps on a computer with much more RAM than mine radix sort would get better at 10-100 million elements, but I doubt radix is frequently the right choice when the average number of copies is low.
I think the idea is that the most common case is to have few unique values compared to the number of entries. If that's not the case, alg=:dict
is there to choose another algorithm. The docs could mention this.