nmatrix Speed up nmatrix basic stats and aritmetic operations

NMatrix statistics and math methods are painfully slow.

For example,


n = NMatrix.new([1000], [1]*1000)
a = [1]*1000

Benchmark.bm do |x|
  x.report("nm") { n.sum }
  x.report("arry") {a.inject(:+)}
end
#       user     system      total        real
#nm  0.040000   0.000000   0.040000 (  0.041888)
#arry  0.000000   0.000000   0.000000 (  0.000218)

Benchmark.bm do |x|
  x.report("nm") { n.mean }
  x.report("arry") {a.inject(:+)/a.size}
end
#       user     system      total        real
# nm  0.040000   0.000000   0.040000 (  0.040449)
# arry  0.000000   0.000000   0.000000 (  0.000200)

Maybe port them to C?

May 12 '15 18:05 v0dro

Eww, that is painfully slow. However, I'd like — if possible — a little more information on why it's slow.

I notice that mean calls inject_rank:

https://github.com/SciRuby/nmatrix/blob/721942cc865f01c3d1092868c8aeb6e4bed40466/lib/nmatrix/enumerate.rb#L217

I notice that inject_rank creates a new NMatrix object — as does each_rank, probably, in the form of a reference.

I'm not sure that's the source of the problem, but it'd be nice to know where the slow-down is rather than just saying, "Oh, let's rewrite in C." Any statistical functions are going to need to make use of the enumerability of matrices, and that code is written in C/C++ — and is probably already almost as fast as it'll ever be.

May 12 '15 19:05 translunar

Yeah, we do need more specific benchmarks (not only in this case, but running benchmarks so we can assert that no performance degradation is happening in future updates).

May 15 '15 17:05 agarie

I profiled Sameer's example (except with 100000 elements): https://rawgit.com/wlevine/f4f6954e14b0718f10df/raw/nmatrix_sum_profile.html

It doesn't look like there is just one culprit.

One thing I noticed is that it was spending a lot of time in NMatrix#cast, but for this example it shouldn't have to do any casts. So I tried a fix to avoid this: https://github.com/SciRuby/nmatrix/compare/master...wlevine:issue362

But this only sped up the test by ~10%. I think the basic problem is that summing a 1-D array is just adding up a bunch of numbers, but what we are actually doing is adding a bunch of 1x1 NMatrix's, which creates a lot of overhead. So this could possibly relate to #351

May 15 '15 20:05 wlevine

Well, I think there's an easier way to do what you did in that fix. Keep in mind that each_with_indices is going to call dense_map also, so you're adding an extra layer there. Simply put, sometimes it's casting a matrix of, say, type :float64 to :float64, which is unnecessary.

So you just need an if statement in that function which checks whether to cast or not. That should give you more of a speed-up.

I also think you're correct that it does relate to #351. Excellent observation.

May 15 '15 20:05 translunar

The issue is that dense_map always returns a :object NMatrix. So in the original code, the cast is necessary.

May 15 '15 21:05 wlevine

Oh, true. Well, in that case, I'm having trouble figuring out why your modification should be faster than what was there before. It seems like you're doing an extra copy. Hmm.

May 15 '15 21:05 translunar

Doesn't the modification use nm_dense_each_with_indices internally and not nm_dense_map?

That does something slightly differently, in that it would cast each element individually from a ruby object to the appropriate dtype (during slice assignment), but doesn't actually construct an intermediate object dtype nmatrix. That could possibly account for the 10%?

Agree with John that we should figure out why this is so slow. There's sort of two separate issues here: why this is slower than doing it on an array, and why it's not faster. We should be able to get it approximately as fast (within an order of magnitude) as using an array without reimplementing in C. If we then want to try to make the performance even better, then we can reimplement in C.

One additional benchmark that's an interesting comparison:

Benchmark.bm do |x|
  x.report("nm") {n.sum}  
  x.report("ary") {a.inject(:+)}  
  x.report("nm_to_a") {n.to_a.inject(:+)}  
end  

       user     system      total        real
nm  0.020000   0.000000   0.020000 (  0.019507)
ary  0.000000   0.000000   0.000000 (  0.000141)
nm_to_a  0.000000   0.000000   0.000000 (  0.000866)

so we're only a factor of ~6 slower if we just convert to a ruby array first and then sum, vs. > 100x slower if we do it entirely in nmatrix. We should at least be able to hit the factor of 6 without reimplementing in C! to_a uses slicing internally and then appends each element to an array in ruby, so it's not exactly the model of efficiency either...

May 15 '15 23:05 cjfuller

I don't think "Let's reimplement in C" is a good strategy every time something is too slow. I mean, a lot of this stuff was written in C and it made compilation incredibly slow, and also maintenance was impossible.

I do think it's okay to implement some things in C, but if this is our strategy, everything ends up in C.

May 15 '15 23:05 translunar

One fear I have is that our strategy for avoiding garbage collection is creating a lot of overhead. Is that a possibility?

May 21 '15 15:05 translunar

It's a possibility. I did some general benchmarking back when I added it, and it was sort of in the ~5% slowdown range on average, but I don't think I benchmarked this thing in particular, which certainly makes more than average usage of it. Definitely worth looking into.

May 21 '15 17:05 cjfuller

How about we follow a coding strategy where creation of ruby objects in C code is done only when _absolutely_necessary?

All the other times they'll be created in ruby and passed down to C. This might make functions larger but I dont think that should be a big problem.

i say this because as far I know objects created in ruby dont need the special GC protection mechanism employed in nmatrix.

May 21 '15 17:05 v0dro

I think that in the mapping functions, all the ruby objects are created in ruby. In particular, the special GC protection has to address the case where the following happens:

NMatrix (say a float dtype) yields float to ruby
ruby does some operation that returns an object
NMatrix stores the object that was created in ruby
NMatrix yields next float, now there's no longer a reference to the original object anywhere in ruby
GC runs, possibly collecting the object from the first iteration
NMatrix tries to do something with the objects it's stored that ruby has returned
segfault

May 21 '15 17:05 cjfuller

nmatrix nmatrix copied to clipboard

Speed up nmatrix basic stats and aritmetic operations

nmatrix
nmatrix copied to clipboard