truffleruby
truffleruby copied to clipboard
Raytracer benchmark is 34x slower than with MRI
I was playing with https://github.com/edin/raytracer benchmarks to see how different versions of Ruby perform as well as Crystal https://www.ruby-forum.com/t/ruby-raytracer-in-crystal/261230/3
I wonder if that has something to do with my environment or is there something specific to TruffleRuby/Raytracer benchmark which makes it so slow?
The results are for 2012 MacBook running MacOS Catalina and I'm using using asfd to install different Ruby versions.
% sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
MRI
% ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin19]
% ruby RayTracer.rb
Completed in 28840.049 ms
TruffleRuby
% ruby --version
truffleruby 21.1.0, like ruby 2.7.2, GraalVM CE Native [x86_64-darwin]
% ruby RayTracer.rb
Completed in 983755.151 ms
Thanks very much for opening the issue. I can reproduce.
This is a performance bug in String#[]=, called from ImageRuby::RubyBitmapModl#set_pixel.
% bundle exec ruby --cpusampler --cpusampler.Output=calltree --cpusampler.Delay=10000 --cpusampler.Mode=roots RayTracer.rb > truffleruby.txt
Completed in 217357.88700000002 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling CallTree. Recorded 191260 samples with period 1ms.
Self Time: Time spent on the top of the stack.
Total Time: Time spent somewhere on the stack.
Opt %: Percent of time spent in compiled and therefore non-interpreted code.
-----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
-----------------------------------------------------------------------------------------------------------------------------------------------------
<main> | 191260ms 100.0% | 0.0% || 4ms 0.0% | 0.0% | RayTracer.rb~1-296:0-7635
RayTracer#render | 191213ms 100.0% | 0.0% || 0ms 0.0% | 0.0% | RayTracer.rb~259-265:6230-6603
Integer#times | 191213ms 100.0% | 0.0% || 5ms 0.0% | 0.0% | resource:/truffleruby/core/integer.rb~143-152:4359-4505
block in RayTracer#render | 191208ms 100.0% | 11.9% || 0ms 0.0% | 0.0% | RayTracer.rb~260-261:6284-6343
block (2 levels) in RayTracer#render | 191128ms 99.9% | 99.2% || 134ms 0.1% | 79.1% | RayTracer.rb~261-264:6314-6589
ImageRuby::RubyBitmapModl#set_pixel | 189089ms 98.9% | 99.4% || 76ms 0.0% | 38.2% | .vendor/truffleruby/2.7.2.1/gems/imageruby-0.2.5/lib/imageruby/bitmap/rbbitmap.rb~65-79:1772-2131
String#[]= | 188969ms 98.8% | 99.6% || 38ms 0.0% | 52.6% | resource:/truffleruby/core/string.rb~1033-1063:27359-28062
String#assign_index | 188931ms 98.8% | 99.5% || 188919ms 98.8% | 99.5% | resource:/truffleruby/core/string.rb~1065-1097:28065-28977
I hacked around this.
diff --git a/ruby/RayTracer.rb b/ruby/RayTracer.rb
index 687e668..d49e972 100644
--- a/ruby/RayTracer.rb
+++ b/ruby/RayTracer.rb
@@ -257,11 +257,13 @@ class RayTracer
end
def render(scene, image, screenWidth, screenHeight)
+ pixels = []
screenHeight.times do |y|
screenWidth.times do |x|
color = self.traceRay(Ray.new(scene.camera().pos, self.getPoint(x, y, screenWidth, screenHeight, scene.camera())), scene, 0)
r,g,b = Color.toDrawingColor(color)
- image.set_pixel(x,y, ImageRuby::Color.from_rgba(r,g,b, 255))
+ # image.set_pixel(x,y, ImageRuby::Color.from_rgba(r,g,b, 255))
+ pixels.push ImageRuby::Color.from_rgba(r,g,b, 255)
end end end
end
This fixed it and it's now a lot faster.
% bundle exec ruby --cpusampler --cpusampler.Output=calltree --cpusampler.Delay=10000 --cpusampler.Mode=roots RayTracer.rb > truffleruby2.txt
Completed in 7088.765 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling CallTree. Recorded 6023 samples with period 1ms.
Self Time: Time spent on the top of the stack.
Total Time: Time spent somewhere on the stack.
Opt %: Percent of time spent in compiled and therefore non-interpreted code.
-----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
-----------------------------------------------------------------------------------------------------------------------------------------------------
<main> | 6023ms 100.0% | 0.0% || 7ms 0.1% | 0.0% | RayTracer.rb~1-298:0-7712
RayTracer#render | 5892ms 97.8% | 0.0% || 0ms 0.0% | 0.0% | RayTracer.rb~259-267:6230-6680
Integer#times | 5892ms 97.8% | 0.0% || 2ms 0.0% | 0.0% | resource:/truffleruby/core/integer.rb~143-152:4359-4505
block in RayTracer#render | 5890ms 97.8% | 3.7% || 2ms 0.0% | 0.0% | RayTracer.rb~261-262:6300-6359
block (2 levels) in RayTracer#render | 5868ms 97.4% | 77.6% || 182ms 3.0% | 28.0% | RayTracer.rb~262-266:6330-6666
RayTracer#traceRay | 4820ms 80.0% | 76.9% || 48ms 0.8% | 25.0% | RayTracer.rb~211-214:4125-4276
RayTracer#shade | 4243ms 70.4% | 82.7% || 94ms 1.6% | 59.6% | RayTracer.rb~216-224:4279-4831
RayTracer#getNaturalColor | 2780ms 46.2% | 79.2% || 61ms 1.0% | 78.7% | RayTracer.rb~230-234:5001-5196
block in RayTracer#getNaturalColor | 2719ms 45.1% | 87.1% || 144ms 2.4% | 88.2% | RayTracer.rb~232:5083-5180
RayTracer#addLight | 2575ms 42.8% | 13.9% || 1513ms 25.1% | 6.9% | RayTracer.rb~236-251:5199-5897
RayTracer#testRay | 658ms 10.9% | 65.0% || 49ms 0.8% | 55.1% | RayTracer.rb~206-209:4025-4122
RayTracer#intersections | 609ms 10.1% | 59.8% || 156ms 2.6% | 61.5% | RayTracer.rb~193-204:3735-4022
chrisseaton@Chriss-MacBook-Pro ruby % ruby -v
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-darwin20]
chrisseaton@Chriss-MacBook-Pro ruby % bundle exec ruby RayTracer.rb
Completed in 15496.454 ms
chrisseaton@Chriss-MacBook-Pro ruby % ruby -v
truffleruby 21.2.0-dev-ca823dac, like ruby 2.7.2, GraalVM CE JVM [x86_64-darwin]
chrisseaton@Chriss-MacBook-Pro ruby % bundle exec ruby RayTracer.rb
Completed in 5714.599 ms
Intuitively I can understand what this performance problem is, but I'll dig into it a bit more and hope to provide a deep-dive answer and fix.
Mmh, so the image is represented as a long String and heavily mutated by String#[]= (used like a buffer)?
https://github.com/tario/imageruby/blob/ba38c3e4a5c8c068566501f0e73357fd219535e4/lib/imageruby/bitmap/rbbitmap.rb#L65
Also that code could probably use setbyte for clarity and efficiency.
One thing that would be worth trying is to use a native String there, that should handle being used as a buffer much better.
It seems a strange design honestly, and it wouldn't even be possible in languages with immutable strings. I would think an Array of Integer, like used in OptCarrot and many other places is way more efficient, both for CRuby and TruffleRuby. Also for a benchmark it seems best to avoid dependencies anyway, maybe we should open an issue/PR on that benchmark repo?
@chrisseaton Could you share the time for CRuby 2.7.2 without your change? I guess it's actually slower than with your change, and so it seems clear that a String image representation is just inefficient for all Ruby implementations.
Without the change:
Completed in 17894.449 ms
At some point TruffleRuby needs to find a way to automatically figure out a string is being used as a buffer and optimise for it.
So that's 1.15x (17894ms -> 15496ms) faster on CRuby when using an intuitive representation (Array) vs a String used like a buffer.
So even on CRuby it seems problematic to use a String as a buffer / use String#[]=.
But the effect is much worse on TruffleRuby. On the upside it makes it really easy to see the problem in the benchmark here.
automatically figure out a string is being used as a buffer and optimise for it.
Yes, that could be nice, probably the first []= or setbyte should migrate to a buffer-like representation (whether it's native or managed). A managed buffer-like String representation seems a high effort, so I would say it's not worth it just for this benchmark.
It seems well-written Ruby code doesn't hit this issue often, so I'm not sure how much of an issue it is for actual production code.
FWIW the Crystal version seems to use an array (of RGBA structs), so https://www.ruby-forum.com/t/ruby-raytracer-in-crystal/261230 is just making an unfair comparison, and the only way to fix that part is to fix the Ruby version of the benchmark.
We actually have a similar issue with setbyte: https://github.com/oracle/truffleruby/issues/2044#issuecomment-668524452
And some discussion there on possible fixes.
FWIW the Crystal version seems to use an array (of RGBA structs), so https://www.ruby-forum.com/t/ruby-raytracer-in-crystal/261230 is just making an unfair comparison, and the only way to fix that part is to fix the Ruby version of the benchmark.
@eregon this looks like a WIP to replace imagery gem for Ruby implementation of benchmark https://github.com/edin/raytracer/blob/master/ruby/Image.rb#L77
I've got some preliminary fixes in https://github.com/oracle/truffleruby/pull/2341.
I'm then going to focus on making this benchmark faster.
require 'benchmark/ips'
puts RUBY_DESCRIPTION
image = '0' * 10_000
Benchmark.ips do |x|
x.report('alloc') do
image = '0' * 10_000
image.size.times do |n|
image[n] = (n % 100).chr
end
end
x.report('no-alloc') do
image.size.times do |n|
image[n] = (n % 100).chr
end
end
end
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-darwin20]
Warming up --------------------------------------
alloc 48.000 i/100ms
no-alloc 47.000 i/100ms
Calculating -------------------------------------
alloc 510.498 (± 4.3%) i/s - 2.592k in 5.088010s
no-alloc 535.121 (± 1.3%) i/s - 2.679k in 5.007249s
truffleruby 21.2.0-dev-13f2d44f, like ruby 2.7.2, GraalVM CE JVM [x86_64-darwin]
Warming up --------------------------------------
alloc 5.000 i/100ms
no-alloc 5.000 i/100ms
Calculating -------------------------------------
alloc 104.189 (±15.4%) i/s - 500.000 in 5.008380s
no-alloc 106.172 (±14.1%) i/s - 510.000 in 4.998794s
TruffleRuby does actually do some work to try to flatten big concat-trees that works well for this case... but it still always allocates a new node for each write which is not what we want.