truffleruby
truffleruby copied to clipboard
Psych.load (for YAML loading) slow
Psych.load is slower in truffleruby than MRI, especially when cold. Below is code to produce a YAML file with similar format and size, the test script (thermometer reports hot after ~40s for me), and output . The actual YAML file takes ~8s and has ~1M characters.
| Cold | Hot | |
|---|---|---|
| MRI | 0.198234s | 5.306ips |
| Truffleruby | 6.483627s | 3.237ips |
letters = "qwertyuiopasdfghjklzxcvbnm"
file_name = "YAML.txt"
fruits = ["apple", "banana", "kiwi", "orange", "pear"]
File.write(file_name, "# This is a very large yml file\n")
File.write(file_name, "---\n", mode: "a")
1835.times do
length = rand(4..15)
object_name = ""
length.times {object_name += letters[rand(26)]}
modifier = letters[rand(26)]+rand(10).to_s
File.write(file_name, object_name + "_" + modifier + ":\n", mode: "a")
File.write(file_name, " reference: " + object_name + "_" + modifier + "\n", mode: "a")
File.write(file_name, " name: " + object_name.capitalize + "\n", mode: "a")
File.write(file_name, " name_again: " + object_name.capitalize + "\n", mode: "a")
File.write(file_name, " modifier: " + modifier+ "\n", mode: "a")
File.write(file_name, " random_float: " + rand(0..0.2).to_s + "\n", mode: "a")
File.write(file_name, " another_field: another_value\n", mode: "a")
File.write(file_name, " list:\n", mode: "a")
fruit = fruits[rand(5)]
File.write(file_name, " - " + fruit + "\n", mode: "a")
File.write(file_name, " fruit: " + fruit + "\n", mode: "a")
random_string = ""
40.times {random_string += letters[rand(26)]}
File.write(file_name, " random_strings:\n", mode: "a")
File.write(file_name, " string1: " + object_name + "/" + object_name + "_" + modifier + "." + random_string + "\n", mode: "a")
random_string = ""
40.times {random_string += letters[rand(26)]}
File.write(file_name, " string2: " + object_name + "/" + object_name + "_" + modifier + "." + random_string + "\n", mode: "a")
random_string = ""
40.times {random_string += letters[rand(26)]}
File.write(file_name, " random_strings_again:\n", mode: "a")
File.write(file_name, " string1: " + object_name + "/" + object_name + "_" + modifier + "_again." + random_string + "\n", mode: "a")
random_string = ""
40.times {random_string += letters[rand(26)]}
File.write(file_name, " string2: " + object_name + "/" + object_name + "_" + modifier + "_again." + random_string + "\n", mode: "a")
end
require 'psych'
require 'benchmark'
require 'benchmark/ips'
file_name = "YAML.txt"
puts Benchmark.measure {Psych.load(File.read(file_name))}
Benchmark.ips do |x|
x.config(:time => 5, :warmup => 50)
x.report('Psych.load') { Psych.load(File.read(file_name))}
end
puts Benchmark.measure {Psych.load(File.read(file_name))}
ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]
0.191097 0.005769 0.196866 ( 0.198234)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 5.306 (±18.8%) i/s - 27.000 in 5.140672s
0.189628 0.001635 0.191263 ( 0.191423)
truffleruby (Shopify) 20.3.0-dev-151cfcb0, like ruby 2.6.6, GraalVM CE JVM [x86_64-darwin]
25.490235 0.578915 26.069150 ( 6.483627)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 3.237 (± 0.0%) i/s - 17.000 in 5.268170s
0.529269 0.001507 0.530776 ( 0.310655)
Thanks for the report (internal issue: GR-29306).
Some numbers on my local machine, for
require 'psych'
require 'benchmark'
file_name = "YAML.txt"
puts Benchmark.measure {
Psych.load(File.read(file_name))
}
with
$ for i in 1 2 3; do someruby bench_yaml.rb; done
reporting the median result and the full process time.
oracle/truffleruby@3c7e1715e37d95c8f79e1e1c7228e66f35f77881 oracle/graal@c616f346ae4552c04be74fe927c81f6f45fd6e6e
- MRI 2.6: 0.13s / 0.206
- Native CE: 5.76s / 6.653
- Native CE
--engine.MultiTier: 2.66s / 3.528 - Native CE
--engine.MultiTier --engine.Splitting=false: 2.91s / 3.637 - Native CE
--engine.Mode=latency: 2.04s / 2.761
$ jt -u native ruby --experimental-options --engine.MultiTier --cpusampler --cpusampler.Mode=roots --cpusampler.SampleInternal bench_yaml.rb
12.066808 0.142028 12.208836 ( 2.827103)
----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling Histogram. Recorded 3211 samples with period 1ms.
Self Time: Time spent on the top of the stack.
Total Time: Time spent somewhere on the stack.
Opt %: Percent of time spent in compiled and therefore non-interpreted code.
----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
yaml_parser_scan_plain_scalar | 874ms 27.2% | 4.5% || 733ms 22.8% | 2.7% | src/main/c/psych/yaml/scanner.c~3389:92048-92120
parse | 1880ms 58.5% | 0.0% || 305ms 9.5% | 0.0% | src/main/c/psych/psych_parser.c~251:5989-6041
yaml_parser_update_buffer | 185ms 5.8% | 7.6% || 184ms 5.7% | 7.1% | src/main/c/psych/yaml/reader.c~142:3763-3825
rb_ary_store | 198ms 6.2% | 89.4% || 181ms 5.6% | 90.6% | src/main/c/cext/array.c~69:1516-1572
Kernel#gem_original_require | 751ms 23.4% | 0.0% || 63ms 2.0% | 0.0% | resource:/truffleruby/core/kernel.rb~226-238:7090-7470
Kernel#require | 876ms 27.3% | 0.0% || 62ms 1.9% | 0.0% | resource:/truffleruby/core/kernel.rb~243-277:7679-8982
Psych::Visitors::ToRuby#deserialize | 217ms 6.8% | 19.4% || 46ms 1.4% | 8.7% | mxbuild/truffleruby-native/jre/languages/ruby/lib/mri/psych/visitors/to_ruby.rb~46-119:1103-3562
Psych::ScalarScanner#tokenize | 149ms 4.6% | 6.0% || 45ms 1.4% | 2.2% | mxbuild/truffleruby-native/jre/languages/ruby/lib/mri/psych/scalar_scanner.rb~32-111:1029-3272
yaml_parser_fetch_next_token | 1035ms 32.2% | 70.4% || 40ms 1.2% | 12.5% | src/main/c/psych/yaml/scanner.c~860:25970-26020
yaml_parser_fetch_more_tokens | 1078ms 33.6% | 79.1% || 37ms 1.2% | 24.3% | src/main/c/psych/yaml/scanner.c~800:24651-24702
rb_funcallv_public | 329ms 10.2% | 80.2% || 34ms 1.1% | 55.9% | src/main/c/cext/call.c~26:885-968
...
~733ms/22.8% is spent in yaml_parser_scan_plain_scalar and that's only called compiled 2.7% of the time, that seems suspicious.
$ jt -u native ruby --experimental-options --engine.MultiTier --engine.CompilationStatistics bench_yaml.rb
12.870033 0.140932 13.010965 ( 2.838401)
[To redirect Truffle log output to a file use one of the following options:
* '--log.file=<path>' if the option is passed using a guest language launcher.
* '-Dpolyglot.log.file=<path>' if the option is passed using the host Java launcher.
* Configure logging using the polyglot embedding API.]
[engine] Truffle runtime statistics for engine 2
Compilations : 948
Success : 913
Temporary Bailouts : 20
org.graalvm.compiler.core.common.RetryableBailoutException: Assumption invalidated while compiling code: Object: 12
org.graalvm.compiler.core.common.CancellationBailoutException: Compilation cancelled.: 8
Permanent Bailouts : 0
Failed : 0
Interrupted : 15
Invalidated : 259
null : 92
constants are unmodified Object : 54
methods are unmodified Object : 42
methods are unmodified Kernel : 34
Profiled Argument Types : 13
methods are unmodified #<Class:Object> : 6
methods are unmodified #<Class:#<Module:0x398>>: 3
frame version : 1
Queues : 1265
Dequeues : 186
Split call node : 38
Target inlined into only caller : 36
constants are unmodified Object : 31
null : 28
methods are unmodified Kernel : 16
nodeRewritingAssumption : 14
methods are unmodified Object : 7
: 4
constants are unmodified IO : 2
constants are unmodified Kernel : 1
Splits : 6150
Compilation Accuracy : 0.726793
Queue Accuracy : 0.852964
Compilation Utilization : 5.023521
Remaining Compilation Queue : 170
Time to queue : count= 968, sum= 426565, min= 0, average= 440.67, max= 3783 (milliseconds), maxTarget=BasicObject#__id__
Time waiting in queue : count= 948, sum= 278340, min= 0, average= 293.61, max= 1589 (milliseconds), maxTarget=Psych::Nodes::Scalar#initialize
Time for compilation : count= 933, sum= 19035, min= 0, average= 20.40, max= 1230 (milliseconds), maxTarget=yaml_parser_parse
Truffle Tier : count= 913, sum= 7182, min= 0, average= 7.87, max= 464 (milliseconds), maxTarget=yaml_parser_parse
Graal Tier : count= 913, sum= 8617, min= 0, average= 9.44, max= 594 (milliseconds), maxTarget=yaml_parser_parse
Code Installation : count= 913, sum= 1407, min= 0, average= 1.54, max= 176 (milliseconds), maxTarget=yaml_parser_scan_plain_scalar
Graal node count :
After Truffle Tier : count= 937, sum= 290542, min= 25, average= 310.08, max= 7299, maxTarget=yaml_parser_parse
After Graal Tier : count= 925, sum= 820131, min= 9, average= 886.63, max= 15400, maxTarget=LLVM:protected_scalar
Graal compilation result :
Code size : count= 913, sum= 2876095, min= 36, average= 3150.16, max= 62472, maxTarget=yaml_parser_scan_plain_scalar
Total frame size : count= 913, sum= 237856, min= 32, average= 260.52, max= 5024, maxTarget=yaml_parser_scan_plain_scalar
Exception handlers : count= 913, sum= 3406, min= 0, average= 3.73, max= 38, maxTarget=yaml_parser_parse
Infopoints : count= 913, sum= 27245, min= 1, average= 29.84, max= 424, maxTarget=yaml_parser_scan_plain_scalar
CALL : count= 913, sum= 27245, min= 1, average= 29.84, max= 424, maxTarget=yaml_parser_scan_plain_scalar
Marks : count= 913, sum= 9326, min= 2, average= 10.21, max= 50, maxTarget=Hash#_get_or_undefined
Data references : count= 913, sum= 16560, min= 1, average= 18.14, max= 341, maxTarget=rb_funcallv_public
yaml_parser_scan_plain_scalar is a big huge function (AST 1528, IR 3481/11495, CodeSize 64169), and it seems Sulong takes a long time to compile it (~500ms), and it's compiled rather late. Also it deopts twice.
Here is the function after expanding macros with cc -E scanner.c.
$ jt -u native ruby --experimental-options --engine.MultiTier --engine.TraceCompilation --engine.CompilationStatistics bench_yaml.rb |& grep yaml_parser_scan_plain_scalar
[engine] opt done yaml_parser_scan_plain_scalar |AST 1528|Time 409( 130+279 )ms|Tier 1|Inlined 0Y 9N|IR 3481/11495|CodeSize 64169|Addr 0x7f48e8012000|Src scanner.c:3389
[engine] opt inv. yaml_parser_scan_plain_scalar |AST 1528|Tier First|Calls/Thres 21193/ 1|CallsAndLoop/Thres 428439/ 100|Src scanner.c:3389
[engine] opt deopt yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 21193/ 1|CallsAndLoop/Thres 428444/ 100|Src scanner.c:3389
[engine] opt done yaml_parser_scan_plain_scalar |AST 1545|Time 573( 133+440 )ms|Tier 1|Inlined 0Y 10N|IR 3537/11741|CodeSize 65025|Addr 0x7f48e863e000|Src scanner.c:3389
[engine] opt inv. yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 49545/ 1|CallsAndLoop/Thres 1004336/ 100|Src scanner.c:3389|Reason nodeRewritingAssumption
[engine] opt deopt yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 49545/ 1|CallsAndLoop/Thres 1004402/ 100|Src scanner.c:3389
Code size : count= 867, sum= 2793123, min= 36, average= 3221.60, max= 65025, maxTarget=yaml_parser_scan_plain_scalar
Total frame size : count= 867, sum= 226976, min= 32, average= 261.79, max= 5056, maxTarget=yaml_parser_scan_plain_scalar
Infopoints : count= 867, sum= 26199, min= 1, average= 30.22, max= 428, maxTarget=yaml_parser_scan_plain_scalar
CALL : count= 867, sum= 26199, min= 1, average= 30.22, max= 428, maxTarget=yaml_parser_scan_plain_scalar
To get an idea of timing, using time.rb:
$ ruby time.rb 0.1 tool/jt.rb -u native ruby --experimental-options --engine.MultiTier --engine.TraceCompilation bench_yaml.rb |& grep -E 'yaml_parser_scan_plain_scalar|\[time\]'
[time] 0.1
[time] 0.2
[time] 0.3
[time] 0.4
[time] 0.5
[time] 0.6
[time] 0.7
[time] 0.8
[time] 0.9
[time] 1.0
[time] 1.1
[time] 1.2
[time] 1.3
[time] 1.4
[time] 1.5
[time] 1.6
[time] 1.7
[time] 1.8
[engine] opt done yaml_parser_scan_plain_scalar |AST 1528|Time 520( 127+393 )ms|Tier 1|Inlined 0Y 9N|IR 3485/11500|CodeSize 63522|Addr 0x7fd00d379000|Src scanner.c:3389
[time] 1.9
[time] 2.0
[time] 2.1
[engine] opt inv. yaml_parser_scan_plain_scalar |AST 1528|Tier First|Calls/Thres 21193/ 1|CallsAndLoop/Thres 428493/ 100|Src scanner.c:3389
[engine] opt deopt yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 21193/ 1|CallsAndLoop/Thres 428498/ 100|Src scanner.c:3389
[time] 2.2
[time] 2.3
[time] 2.4
[time] 2.5
[time] 2.6
[engine] opt done yaml_parser_scan_plain_scalar |AST 1545|Time 514( 274+240 )ms|Tier 1|Inlined 0Y 10N|IR 3541/11746|CodeSize 64414|Addr 0x7fd00c9cc000|Src scanner.c:3389
[time] 2.7
[time] 2.8
[time] 2.9
[engine] opt inv. yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 49545/ 1|CallsAndLoop/Thres 1004318/ 100|Src scanner.c:3389|Reason nodeRewritingAssumption
[engine] opt deopt yaml_parser_scan_plain_scalar |AST 1545|Tier First|Calls/Thres 49545/ 1|CallsAndLoop/Thres 1004384/ 100|Src scanner.c:3389
[time] 3.0
[time] 3.1
[time] 3.2
[time] 3.3
[time] 3.4
A flamegraph using:
jt -u native profile --engine.MultiTier --metrics-profile-require=detail bench_yaml.rb
shows (total) 60% in parse, 17% in Psych::Nodes::Node#to_ruby, 23% in require psych.
(RubyGems is required but it shouldn't need to be, due to https://github.com/oracle/truffleruby/blob/3c7e1715e37d95c8f79e1e1c7228e66f35f77881/lib/mri/psych.rb#L14, fixed on master)
I'm rerunning this with:
require 'psych'
require 'benchmark'
require 'benchmark/ips'
file_name = "YAML.txt"
puts Benchmark.measure {Psych.load(File.read(file_name))}
Benchmark.ips do |x|
x.iterations = 2
x.report('Psych.load') { Psych.load(File.read(file_name))}
end
puts Benchmark.measure {Psych.load(File.read(file_name))}
CRuby 2.7.3:
0.146990 0.003974 0.150964 ( 0.151312)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 6.517 (± 0.0%) i/s - 33.000 in 5.067239s
Psych.load 6.495 (± 0.0%) i/s - 33.000 in 5.084755s
0.147100 0.000000 0.147100 ( 0.147404)
Native CE:
10.239601 0.207475 10.447076 ( 2.152692)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 3.117 (± 0.0%) i/s - 16.000 in 5.143648s
Psych.load 3.198 (± 0.0%) i/s - 16.000 in 5.017373s
0.387159 0.002082 0.389241 ( 0.296447)
JVM CE:
32.300689 0.376786 32.677475 ( 3.223755)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 4.614 (± 0.0%) i/s - 23.000 in 5.037760s
Psych.load 5.133 (± 0.0%) i/s - 26.000 in 5.083813s (more iterations have the same perf)
0.490983 0.000058 0.491041 ( 0.201737)
From what I understood from @chrisseaton the actual YAML loading in SR is much larger though, because it takes way more than 2 seconds to load the file for SR startup. Could you share the size of that file or what differs from this one? Optimizing below 2 seconds (a lot about warmup) vs optimizing 20s or so (peak perf most of the time) is probably quite a bit different.
JVM CE profile, I used x.iterations = 3:
rb_ary_store seems way too high.
$ ruby --cpusampler --cpusampler.Mode=roots bench_yaml.rb
----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
rb_ary_store | 7126ms 29.0% | 99.6% || 6450ms 26.3% | 99.6% | src/main/c/cext/array.c~82:1943-1999
yaml_parser_scan_plain_scalar | 3203ms 13.0% | 63.8% || 1621ms 6.6% | 41.5% | src/main/c/psych/yaml/scanner.c~3389:92053-92125
Truffle::RegexpOperations.match_in_region | 1594ms 6.5% | 92.6% || 1387ms 5.6% | 92.3% | resource:/truffleruby/core/truffle/regexp_operations.rb~67-75:2236-2687
parse | 17718ms 72.1% | 80.8% || 1063ms 4.3% | 56.8% | src/main/c/psych/psych_parser.c~253:6066-6118
yaml_free | 779ms 3.2% | 99.7% || 779ms 3.2% | 99.7% | src/main/c/psych/yaml/api.c~51:751-770
Array#[]= | 676ms 2.8% | 99.7% || 676ms 2.8% | 99.7% | (core)~1:0
CExt#rb_funcallv_public | 1214ms 4.9% | 96.5% || 610ms 2.5% | 97.9% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/truffle/truffle/cext.rb~872-874:18582-18696
yaml_parser_update_buffer | 614ms 2.5% | 36.8% || 599ms 2.4% | 35.9% | src/main/c/psych/yaml/reader.c~142:3763-3825
Truffle::Interop.execute_without_conversion | 9894ms 40.3% | 99.1% || 477ms 1.9% | 97.3% | (core)~1:0
Kernel#require | 1110ms 4.5% | 0.0% || 469ms 1.9% | 0.0% | resource:/truffleruby/core/kernel.rb~243-277:7681-8984
yaml_malloc | 422ms 1.7% | 99.5% || 422ms 1.7% | 99.5% | src/main/c/psych/yaml/api.c~31:443-466
Array#<< | 353ms 1.4% | 98.3% || 353ms 1.4% | 98.3% | (core)~1:0
rb_funcallv_public | 9070ms 36.9% | 98.2% || 334ms 1.4% | 84.1% | src/main/c/cext/call.c~35:1291-1374
rb_ary_new_from_values | 7525ms 30.6% | 99.7% || 303ms 1.2% | 99.0% | src/main/c/cext/array.c~53:1347-1405
Psych::ScalarScanner#tokenize | 2865ms 11.7% | 86.4% || 287ms 1.2% | 77.7% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/mri/psych/scalar_scanner.rb~31-99:1008-3185
String#-@ | 430ms 1.8% | 89.3% || 278ms 1.1% | 87.4% | resource:/truffleruby/core/string.rb~1510-1519:39739-39946
yaml_event_delete | 599ms 2.4% | 97.8% || 269ms 1.1% | 95.5% | src/main/c/psych/yaml/api.c~980:24537-24574
Psych::Visitors::Visitor#visit | 5267ms 21.4% | 43.7% || 248ms 1.0% | 80.2% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/mri/psych/visitors/visitor.rb~15-17:264-339
All the rb_ary_store calls are like:
rb_ary_store
rb_ary_new_from_values
rb_funcallv_public (aliased rb_funcall3)
Looks like we need to optimize rb_ary_new_from_values() (notably, rb_ary_new_capa ignores capacity).
After optimizing rb_funcallv_public() to not call rb_ary_new_from_values but directly unwrap the passed VALUE*, JVM CE is significantly faster on peak than CRuby!
41.520555 0.445382 41.965937 ( 3.642403)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 9.779 (±10.2%) i/s - 49.000 in 5.073748s
Psych.load 9.717 (±10.3%) i/s - 48.000 in 5.018427s
Psych.load 10.038 (±10.0%) i/s - 50.000 in 5.029198s
0.502782 0.002007 0.504789 ( 0.109505)
Profile after that:
----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name | Total Time | Opt % || Self Time | Opt % | Location
----------------------------------------------------------------------------------------------------------------------------------------------------
yaml_parser_scan_plain_scalar | 4578ms 19.0% | 66.0% || 2422ms 10.1% | 47.7% | src/main/c/psych/yaml/scanner.c~3389:92053-92125
Truffle::RegexpOperations.match_in_region | 2784ms 11.6% | 93.0% || 2421ms 10.1% | 92.4% | src/main/ruby/truffleruby/core/truffle/regexp_operations.rb~67-75:2236-2687
parse | 14135ms 58.8% | 72.3% || 1425ms 5.9% | 54.1% | src/main/c/psych/psych_parser.c~253:6066-6118
yaml_free | 1119ms 4.7% | 99.7% || 1119ms 4.7% | 99.7% | src/main/c/psych/yaml/api.c~51:751-770
yaml_parser_update_buffer | 752ms 3.1% | 37.4% || 740ms 3.1% | 36.8% | src/main/c/psych/yaml/reader.c~142:3763-3825
Truffle::Interop.execute_without_conversion | 3163ms 13.2% | 98.5% || 686ms 2.9% | 99.0% | (core)~1:0
CExt#rb_funcallv_public | 1410ms 5.9% | 96.6% || 660ms 2.7% | 97.3% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/truffle/truffle/cext.rb~872-874:18582-18696
yaml_malloc | 623ms 2.6% | 99.7% || 623ms 2.6% | 99.7% | src/main/c/psych/yaml/api.c~31:443-466
Array#<< | 535ms 2.2% | 98.3% || 535ms 2.2% | 98.3% | (core)~1:0
rb_funcallv_public | 1929ms 8.0% | 94.0% || 519ms 2.2% | 89.6% | src/main/c/cext/call.c~35:1291-1374
Improvements to rb_funcallv*() are now merged: https://github.com/oracle/truffleruby/commit/a50c8e527b92f48c8ac096e8172fbd2dcd181575
I have tried with a native libyaml instead of the bundled one, by removing --enable-bundled-libyaml in src/main/c/Makefile.
Here is a branch with that: https://github.com/oracle/truffleruby/compare/master...eregon:truffleruby:use-system-yaml
ruby -ryaml -e 'p Psych.libyaml_version' prints the libyaml version and it's 0.2.5 in both cases.
This is on Native CE.
CRuby 3.0.3:
$ for i in 1 2 3; do ruby bench_yaml.rb; done
0.133155 0.005777 0.138932 ( 0.140035)
0.137491 0.004078 0.141569 ( 0.141988)
0.138491 0.005113 0.143604 ( 0.144235)
$ ruby bench_yaml_peak.rb
0.133569 0.004986 0.138555 ( 0.138941)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 7.334 (± 0.0%) i/s - 37.000 in 5.051702s
Psych.load 7.439 (± 0.0%) i/s - 38.000 in 5.112715s
Psych.load 7.553 (± 0.0%) i/s - 38.000 in 5.034971s
0.124949 0.000001 0.124950 ( 0.125305)
TruffleRuby master 7314b6769245c5e0cce5940e27b57ec2d0116cad:
25.969647 0.788488 26.758135 ( 6.979160)
26.047104 0.750634 26.797738 ( 7.001821)
26.819784 0.818494 27.638278 ( 7.287955)
25.457242 0.881856 26.339098 ( 7.018586)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 3.777 (± 0.0%) i/s - 19.000 in 5.107471s
Psych.load 3.208 (± 0.0%) i/s - 17.000 in 5.335320s
Psych.load 3.327 (± 0.0%) i/s - 17.000 in 5.123493s
0.658149 0.021080 0.679229 ( 0.288992)
TruffleRuby master 7314b6769245c5e0cce5940e27b57ec2d0116cad without --enable-bundled-libyaml:
5.165586 0.074782 5.240368 ( 1.317167)
5.254264 0.069016 5.323280 ( 1.340259)
5.534712 0.099797 5.634509 ( 1.346993)
4.920097 0.076287 4.996384 ( 1.251470)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 4.093 (± 0.0%) i/s - 21.000 in 5.170312s
Psych.load 4.122 (± 0.0%) i/s - 21.000 in 5.149288s
Psych.load 4.193 (± 0.0%) i/s - 21.000 in 5.045918s
0.306169 0.014968 0.321137 ( 0.257934)
So using a native libyaml seems significantly faster on startup/warmup, and seems at least as fast on peak.
bench_yaml.rb:
require 'psych'
require 'benchmark'
file_name = "YAML.txt"
puts Benchmark.measure {
Psych.load(File.read(file_name))
}
bench_yaml_peak.rb:
require 'psych'
require 'benchmark'
require 'benchmark/ips'
file_name = "YAML.txt"
puts Benchmark.measure {Psych.load(File.read(file_name))}
Benchmark.ips do |x|
x.warmup = 5
x.iterations = 3
x.report('Psych.load') { Psych.load(File.read(file_name))}
end
puts Benchmark.measure {Psych.load(File.read(file_name))}
Rerunning this on Native CE: TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd:
25.194098 0.933028 26.127126 ( 7.197953)
25.284336 0.923600 26.207936 ( 7.119082)
25.757382 0.892540 26.649922 ( 7.150440)
23.867417 0.921593 24.789010 ( 6.775556)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 5.300 (±37.7%) i/s - 25.000 in 5.082388s
Psych.load 5.813 (±34.4%) i/s - 28.000 in 5.127598s
Psych.load 5.518 (±18.1%) i/s - 28.000 in 5.223663s
0.303520 0.024993 0.328513 ( 0.262015)
TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd with __attribute__((noinline)) on yaml_parser_fetch_next_token:
This is because there seems to be an LLVM issue (GR-39490, https://github.com/llvm/llvm-project/issues/57006) where the caller inlines yaml_parser_fetch_next_token and that blows up the bitcode and the number of local variables way more than it should, leading to bad warmup. The noinline prevents that, and makes the 1st iteration almost twice as fast.
14.544912 0.385808 14.930720 ( 3.754364)
14.786649 0.423228 15.209877 ( 3.900702)
12.700065 0.355561 13.055626 ( 3.286032)
14.659294 0.449365 15.108659 ( 3.892674)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 5.447 (±18.4%) i/s - 27.000 in 5.075687s
Psych.load 4.376 (±22.9%) i/s - 22.000 in 5.099205s
Psych.load 4.257 (±23.5%) i/s - 21.000 in 5.003525s
0.263249 0.017929 0.281178 ( 0.262726)
TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd without --enable-bundled-libyaml (so native system libyaml):
5.118278 0.083548 5.201826 ( 1.295927)
4.953188 0.086894 5.040082 ( 1.227787)
5.286293 0.083495 5.369788 ( 1.325290)
5.157928 0.070645 5.228573 ( 1.294500)
Warming up --------------------------------------
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Psych.load 1.000 i/100ms
Calculating -------------------------------------
Psych.load 4.251 (± 0.0%) i/s - 22.000 in 5.231084s
Psych.load 4.129 (±24.2%) i/s - 21.000 in 5.175657s
Psych.load 4.135 (±24.2%) i/s - 21.000 in 5.181485s
0.309581 0.020035 0.329616 ( 0.262114)
Given upstream is removing the bundling of libyaml (https://bugs.ruby-lang.org/issues/18571, in Ruby 3.2), I think the simplest option here is to just add libyaml as a dependency, and then use the system libyaml (numbers in https://github.com/oracle/truffleruby/issues/2089#issuecomment-1167329301).
Psych 5 no longer bundles libyaml, so installing psych 5 will use the system libyaml. Other than that, we can either wait that TruffleRuby updates to Ruby 3.2 or make TruffleRuby disable the usage of the the bundled libyaml explicitly and add the libyaml dependency as requirement.
As a heads up, Homebrew on M1 doesn't add header files to a standard location, so the patch to use libyaml needs some extra configuration. I'm using the following for now:
diff --git a/src/main/c/Makefile b/src/main/c/Makefile
index 5fe0930245..a4e327addf 100644
--- a/src/main/c/Makefile
+++ b/src/main/c/Makefile
@@ -161,7 +161,7 @@ rbconfig-sizeof/sizeof.$(DLEXT): rbconfig-sizeof/Makefile rbconfig-sizeof/sizes.
# psych
# Always use the bundled libyaml, as we want it as bitcode and avoid extra handles
psych/Makefile: psych/extconf.rb $(EXTCONF_DEPS)
- $(Q) cd psych && $(RUBY) extconf.rb --enable-bundled-libyaml || $(IF_EXTCONF_FAIL)
+ $(Q) cd psych && $(RUBY) extconf.rb --with-libyaml-dir=/opt/homebrew/Cellar/libyaml/0.2.5 || $(IF_EXTCONF_FAIL)
psych/psych.$(DLEXT): psych/Makefile psych/*.c psych/*.h psych/yaml/*.c psych/yaml/*.h
$(Q) cd psych && $(MAKE) $(MKMF_MAKEFILE_SUBPROCESS_FLAGS)
We already handle that for OpenSSL.
While using libyaml does speed things up, our YAML processing is still quite a bit slower than MRI. I think a common and simple example is using the faker gem. Running the following script with Faker 3.1.0 (the latest release as of today):
require 'benchmark'
require 'faker'
puts Benchmark.measure { Faker::Config.locale }
I see the following timings:
TruffleRuby 22.3.0:
> ruby -v faker.rb
truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [aarch64-darwin]
34.324998 1.046779 35.371777 ( 12.052866)
TruffleRuby 23.0.0-dev with libyaml:
> ruby -v faker.rb
truffleruby 23.0.0-dev-71f07786, like ruby 3.1.3, GraalVM CE Native [aarch64-darwin]
11.324725 0.419365 11.744090 ( 4.936416)
MRI 3.1.3:
> ruby -v faker.rb
ruby 3.1.3p185 (2022-11-24 revision 1a6b16756e) [arm64-darwin21]
0.462839 0.031005 0.493844 ( 0.498847)
MRI 3.2.0:
> ruby -v faker.rb
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin21.5.0]
0.486561 0.026892 0.513453 ( 0.528007)
While there's a huge improvement using libyaml, it's still incredibly slow relative to MRI.
Ultimately, all this Faker line is doing is loading a bunch of YAML files for I18n. And that's the real common case I'm seeing. Faker just happens to load a bunch of localization files. But, Faker is commonly used in unit tests for generating dummy data. Even without Faker, it's common to have localized data in Rails and I'm seeing other tools slow down quite a bit loading localization files with I18n.
We discussed using the system libyaml on Slack.
The ABI of libyaml seems quite stable since soname 0.2, even more than zlib actually:
https://abi-laboratory.pro/index.php?view=timeline&l=zlib
https://abi-laboratory.pro/index.php?view=timeline&l=openssl
https://abi-laboratory.pro/index.php?view=timeline&l=yaml
So we might not need to recompile psych on every system and be able to reuse the precompiled one in the truffleruby build (like for zlib, unlike for openssl).
In terms of packages (from https://github.com/rbenv/ruby-build/wiki): RHEL-based: libyaml-devel or libyaml for non-dev Debian-based: libyaml-dev or libyaml-0-2 for non-dev Homebrew: libyaml We pick -dev versions so if we need to recompile in the future there won't be a need to change dependencies, and also CRuby requires them.
With https://github.com/oracle/truffleruby/pull/2899 TruffleRuby now uses the system libyaml, like CRuby 3.2+ and Psych 5+. That improves warmup of YAML.parse quite a bit as showed above.
We need to look at the case of Faker and see what we can optimize better there.
It would be good to have peak number for the Faker example, to see what's the difference after warmup.
If I modify bootsnap to enable the YAML compile cache it speeds Faker::Config.locale for me by 50%.
faker: 29.136432 0.396282 29.532714 ( 4.765890)
faker: 12.030619 0.246628 12.277247 ( 2.349102)
When I run Kevin's script locally I'm seeing the same results: TruffleRuby is taking 4s longer than CRuby for Faker::Config.locale.
One candidate for improvement that I see is Truffle::RegexpOperations.match_in_region.
With the cpusampler, i see Regexp.match? costing
- ~620ms regular
- ~230ms with
--use-truffle-regex=false - ~80ms If I precompute the results and store them in a hash and use that
So there is a potential savings of up to 400-500ms, which could account for as much as 12% of the 4s difference that I get.
If I modify bootsnap to enable the YAML compile cache it speeds
Faker::Config.localefor me by 50%.
So that caches the YAML with MessagePack? That's a C extension that needs some warmup so it's surprising it's that much faster.
@djoooooe Any idea what we could improve regarding https://github.com/oracle/truffleruby/issues/2089#issuecomment-1613936967 ?