nmatrix
nmatrix copied to clipboard
Speed up nmatrix basic stats and aritmetic operations
NMatrix statistics and math methods are painfully slow.
For example,
n = NMatrix.new([1000], [1]*1000)
a = [1]*1000
Benchmark.bm do |x|
x.report("nm") { n.sum }
x.report("arry") {a.inject(:+)}
end
# user system total real
#nm 0.040000 0.000000 0.040000 ( 0.041888)
#arry 0.000000 0.000000 0.000000 ( 0.000218)
Benchmark.bm do |x|
x.report("nm") { n.mean }
x.report("arry") {a.inject(:+)/a.size}
end
# user system total real
# nm 0.040000 0.000000 0.040000 ( 0.040449)
# arry 0.000000 0.000000 0.000000 ( 0.000200)
Maybe port them to C?
Eww, that is painfully slow. However, I'd like — if possible — a little more information on why it's slow.
I notice that mean
calls inject_rank
:
https://github.com/SciRuby/nmatrix/blob/721942cc865f01c3d1092868c8aeb6e4bed40466/lib/nmatrix/enumerate.rb#L217
I notice that inject_rank
creates a new NMatrix
object — as does each_rank
, probably, in the form of a reference.
I'm not sure that's the source of the problem, but it'd be nice to know where the slow-down is rather than just saying, "Oh, let's rewrite in C." Any statistical functions are going to need to make use of the enumerability of matrices, and that code is written in C/C++ — and is probably already almost as fast as it'll ever be.
Yeah, we do need more specific benchmarks (not only in this case, but running benchmarks so we can assert that no performance degradation is happening in future updates).
I profiled Sameer's example (except with 100000 elements): https://rawgit.com/wlevine/f4f6954e14b0718f10df/raw/nmatrix_sum_profile.html
It doesn't look like there is just one culprit.
One thing I noticed is that it was spending a lot of time in NMatrix#cast, but for this example it shouldn't have to do any casts. So I tried a fix to avoid this: https://github.com/SciRuby/nmatrix/compare/master...wlevine:issue362
But this only sped up the test by ~10%. I think the basic problem is that summing a 1-D array is just adding up a bunch of numbers, but what we are actually doing is adding a bunch of 1x1 NMatrix's, which creates a lot of overhead. So this could possibly relate to #351
Well, I think there's an easier way to do what you did in that fix. Keep in mind that each_with_indices is going to call dense_map also, so you're adding an extra layer there. Simply put, sometimes it's casting a matrix of, say, type :float64 to :float64, which is unnecessary.
So you just need an if statement in that function which checks whether to cast or not. That should give you more of a speed-up.
I also think you're correct that it does relate to #351. Excellent observation.
The issue is that dense_map always returns a :object NMatrix. So in the original code, the cast is necessary.
Oh, true. Well, in that case, I'm having trouble figuring out why your modification should be faster than what was there before. It seems like you're doing an extra copy. Hmm.
Doesn't the modification use nm_dense_each_with_indices
internally and not nm_dense_map
?
That does something slightly differently, in that it would cast each element individually from a ruby object to the appropriate dtype (during slice assignment), but doesn't actually construct an intermediate object dtype nmatrix. That could possibly account for the 10%?
Agree with John that we should figure out why this is so slow. There's sort of two separate issues here: why this is slower than doing it on an array, and why it's not faster. We should be able to get it approximately as fast (within an order of magnitude) as using an array without reimplementing in C. If we then want to try to make the performance even better, then we can reimplement in C.
One additional benchmark that's an interesting comparison:
Benchmark.bm do |x|
x.report("nm") {n.sum}
x.report("ary") {a.inject(:+)}
x.report("nm_to_a") {n.to_a.inject(:+)}
end
user system total real
nm 0.020000 0.000000 0.020000 ( 0.019507)
ary 0.000000 0.000000 0.000000 ( 0.000141)
nm_to_a 0.000000 0.000000 0.000000 ( 0.000866)
so we're only a factor of ~6 slower if we just convert to a ruby array first and then sum, vs. > 100x slower if we do it entirely in nmatrix. We should at least be able to hit the factor of 6 without reimplementing in C! to_a
uses slicing internally and then appends each element to an array in ruby, so it's not exactly the model of efficiency either...
I don't think "Let's reimplement in C" is a good strategy every time something is too slow. I mean, a lot of this stuff was written in C and it made compilation incredibly slow, and also maintenance was impossible.
I do think it's okay to implement some things in C, but if this is our strategy, everything ends up in C.
One fear I have is that our strategy for avoiding garbage collection is creating a lot of overhead. Is that a possibility?
It's a possibility. I did some general benchmarking back when I added it, and it was sort of in the ~5% slowdown range on average, but I don't think I benchmarked this thing in particular, which certainly makes more than average usage of it. Definitely worth looking into.
How about we follow a coding strategy where creation of ruby objects in C code is done only when _absolutely_necessary?
All the other times they'll be created in ruby and passed down to C. This might make functions larger but I dont think that should be a big problem.
i say this because as far I know objects created in ruby dont need the special GC protection mechanism employed in nmatrix.
I think that in the mapping functions, all the ruby objects are created in ruby. In particular, the special GC protection has to address the case where the following happens:
- NMatrix (say a float dtype) yields float to ruby
- ruby does some operation that returns an object
- NMatrix stores the object that was created in ruby
- NMatrix yields next float, now there's no longer a reference to the original object anywhere in ruby
- GC runs, possibly collecting the object from the first iteration
- NMatrix tries to do something with the objects it's stored that ruby has returned
- segfault