truffleruby Psych.load (for YAML loading) slow

Psych.load is slower in truffleruby than MRI, especially when cold. Below is code to produce a YAML file with similar format and size, the test script (thermometer reports hot after ~40s for me), and output . The actual YAML file takes ~8s and has ~1M characters.

	Cold	Hot
MRI	0.198234s	5.306ips
Truffleruby	6.483627s	3.237ips

letters = "qwertyuiopasdfghjklzxcvbnm"
file_name = "YAML.txt"
fruits = ["apple", "banana", "kiwi", "orange", "pear"]

File.write(file_name, "# This is a very large yml file\n")
File.write(file_name, "---\n", mode: "a")
1835.times do
  length = rand(4..15)
  object_name = ""
  length.times {object_name += letters[rand(26)]}
  modifier = letters[rand(26)]+rand(10).to_s
  File.write(file_name, object_name + "_" + modifier + ":\n", mode: "a")
  File.write(file_name, "  reference: " + object_name + "_" + modifier + "\n", mode: "a")
  File.write(file_name, "  name: " + object_name.capitalize + "\n", mode: "a")
  File.write(file_name, "  name_again: " + object_name.capitalize + "\n", mode: "a")
  File.write(file_name, "  modifier: " + modifier+ "\n", mode: "a")
  File.write(file_name, "  random_float: " + rand(0..0.2).to_s + "\n", mode: "a")
  File.write(file_name, "  another_field: another_value\n", mode: "a")
  File.write(file_name, "  list:\n", mode: "a")
  fruit = fruits[rand(5)]
  File.write(file_name, "  - " + fruit + "\n", mode: "a")
  File.write(file_name, "  fruit: " + fruit + "\n", mode: "a")
  random_string = ""
  40.times {random_string += letters[rand(26)]}
  File.write(file_name, "  random_strings:\n", mode: "a")
  File.write(file_name, "    string1: " + object_name + "/" + object_name + "_" + modifier + "." + random_string + "\n", mode: "a")
  random_string = ""
  40.times {random_string += letters[rand(26)]}
  File.write(file_name, "    string2: " + object_name + "/" + object_name + "_" + modifier + "." + random_string + "\n", mode: "a")
  random_string = ""
  40.times {random_string += letters[rand(26)]}
  File.write(file_name, "  random_strings_again:\n", mode: "a")
  File.write(file_name, "    string1: " + object_name + "/" + object_name + "_" + modifier + "_again." + random_string + "\n", mode: "a")
  random_string = ""
  40.times {random_string += letters[rand(26)]}
  File.write(file_name, "    string2: " + object_name + "/" + object_name + "_" + modifier + "_again." + random_string + "\n", mode: "a")
end

require 'psych'
require 'benchmark'
require 'benchmark/ips'

file_name = "YAML.txt"

puts Benchmark.measure {Psych.load(File.read(file_name))}

Benchmark.ips do |x|
  x.config(:time => 5, :warmup => 50)
  x.report('Psych.load') { Psych.load(File.read(file_name))}
end

puts Benchmark.measure {Psych.load(File.read(file_name))}

ruby 2.6.5p114 (2019-10-01 revision 67812) [x86_64-darwin19]
  0.191097   0.005769   0.196866 (  0.198234)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      5.306  (±18.8%) i/s -     27.000  in   5.140672s
  0.189628   0.001635   0.191263 (  0.191423)

truffleruby (Shopify) 20.3.0-dev-151cfcb0, like ruby 2.6.6, GraalVM CE JVM [x86_64-darwin]
 25.490235   0.578915  26.069150 (  6.483627)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      3.237  (± 0.0%) i/s -     17.000  in   5.268170s
  0.529269   0.001507   0.530776 (  0.310655)

Sep 04 '20 21:09 LillianZ

Thanks for the report (internal issue: GR-29306).

Some numbers on my local machine, for

require 'psych'
require 'benchmark'

file_name = "YAML.txt"

puts Benchmark.measure {
  Psych.load(File.read(file_name))
}

with

$ for i in 1 2 3; do someruby bench_yaml.rb; done

reporting the median result and the full process time.

oracle/truffleruby@3c7e1715e37d95c8f79e1e1c7228e66f35f77881 oracle/graal@c616f346ae4552c04be74fe927c81f6f45fd6e6e

MRI 2.6: 0.13s / 0.206
Native CE: 5.76s / 6.653
Native CE --engine.MultiTier: 2.66s / 3.528
Native CE --engine.MultiTier --engine.Splitting=false: 2.91s / 3.637
Native CE --engine.Mode=latency: 2.04s / 2.761

$ jt -u native ruby --experimental-options --engine.MultiTier --cpusampler --cpusampler.Mode=roots --cpusampler.SampleInternal bench_yaml.rb
12.066808   0.142028  12.208836 (  2.827103)
----------------------------------------------------------------------------------------------------------------------------------------------------
Sampling Histogram. Recorded 3211 samples with period 1ms.
 Self Time: Time spent on the top of the stack.
 Total Time: Time spent somewhere on the stack.
 Opt %: Percent of time spent in compiled and therefore non-interpreted code.
----------------------------------------------------------------------------------------------------------------------------------------------------
Thread: Thread[main,5,main]
Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
yaml_parser_scan_plain_scalar                                |        874ms  27.2% |   4.5% ||        733ms  22.8% |   2.7% | src/main/c/psych/yaml/scanner.c~3389:92048-92120
parse                                                        |       1880ms  58.5% |   0.0% ||        305ms   9.5% |   0.0% | src/main/c/psych/psych_parser.c~251:5989-6041
yaml_parser_update_buffer                                    |        185ms   5.8% |   7.6% ||        184ms   5.7% |   7.1% | src/main/c/psych/yaml/reader.c~142:3763-3825
rb_ary_store                                                 |        198ms   6.2% |  89.4% ||        181ms   5.6% |  90.6% | src/main/c/cext/array.c~69:1516-1572
Kernel#gem_original_require                                  |        751ms  23.4% |   0.0% ||         63ms   2.0% |   0.0% | resource:/truffleruby/core/kernel.rb~226-238:7090-7470
Kernel#require                                               |        876ms  27.3% |   0.0% ||         62ms   1.9% |   0.0% | resource:/truffleruby/core/kernel.rb~243-277:7679-8982
Psych::Visitors::ToRuby#deserialize                          |        217ms   6.8% |  19.4% ||         46ms   1.4% |   8.7% | mxbuild/truffleruby-native/jre/languages/ruby/lib/mri/psych/visitors/to_ruby.rb~46-119:1103-3562
Psych::ScalarScanner#tokenize                                |        149ms   4.6% |   6.0% ||         45ms   1.4% |   2.2% | mxbuild/truffleruby-native/jre/languages/ruby/lib/mri/psych/scalar_scanner.rb~32-111:1029-3272
yaml_parser_fetch_next_token                                 |       1035ms  32.2% |  70.4% ||         40ms   1.2% |  12.5% | src/main/c/psych/yaml/scanner.c~860:25970-26020
yaml_parser_fetch_more_tokens                                |       1078ms  33.6% |  79.1% ||         37ms   1.2% |  24.3% | src/main/c/psych/yaml/scanner.c~800:24651-24702
rb_funcallv_public                                           |        329ms  10.2% |  80.2% ||         34ms   1.1% |  55.9% | src/main/c/cext/call.c~26:885-968
...

~733ms/22.8% is spent in yaml_parser_scan_plain_scalar and that's only called compiled 2.7% of the time, that seems suspicious.

$ jt -u native ruby --experimental-options --engine.MultiTier --engine.CompilationStatistics bench_yaml.rb                            
 12.870033   0.140932  13.010965 (  2.838401)
[To redirect Truffle log output to a file use one of the following options:
* '--log.file=<path>' if the option is passed using a guest language launcher.
* '-Dpolyglot.log.file=<path>' if the option is passed using the host Java launcher.
* Configure logging using the polyglot embedding API.]
[engine] Truffle runtime statistics for engine 2
  Compilations                                      : 948
    Success                                         : 913
    Temporary Bailouts                              : 20
      org.graalvm.compiler.core.common.RetryableBailoutException: Assumption invalidated while compiling code: Object: 12
      org.graalvm.compiler.core.common.CancellationBailoutException: Compilation cancelled.: 8
    Permanent Bailouts                              : 0
    Failed                                          : 0
    Interrupted                                     : 15
  Invalidated                                       : 259
      null                                          : 92
      constants are unmodified Object               : 54
      methods are unmodified Object                 : 42
      methods are unmodified Kernel                 : 34
      Profiled Argument Types                       : 13
      methods are unmodified #<Class:Object>        : 6
      methods are unmodified #<Class:#<Module:0x398>>: 3
      frame version                                 : 1
  Queues                                            : 1265
  Dequeues                                          : 186
      Split call node                               : 38
      Target inlined into only caller               : 36
      constants are unmodified Object               : 31
      null                                          : 28
      methods are unmodified Kernel                 : 16
      nodeRewritingAssumption                       : 14
      methods are unmodified Object                 : 7
                                                    : 4
      constants are unmodified IO                   : 2
      constants are unmodified Kernel               : 1
  Splits                                            : 6150
  Compilation Accuracy                              : 0.726793
  Queue Accuracy                                    : 0.852964
  Compilation Utilization                           : 5.023521
  Remaining Compilation Queue                       : 170
  Time to queue                                     : count= 968, sum=  426565, min=       0, average=      440.67, max=    3783 (milliseconds), maxTarget=BasicObject#__id__
  Time waiting in queue                             : count= 948, sum=  278340, min=       0, average=      293.61, max=    1589 (milliseconds), maxTarget=Psych::Nodes::Scalar#initialize
  Time for compilation                              : count= 933, sum=   19035, min=       0, average=       20.40, max=    1230 (milliseconds), maxTarget=yaml_parser_parse
    Truffle Tier                                    : count= 913, sum=    7182, min=       0, average=        7.87, max=     464 (milliseconds), maxTarget=yaml_parser_parse
    Graal Tier                                      : count= 913, sum=    8617, min=       0, average=        9.44, max=     594 (milliseconds), maxTarget=yaml_parser_parse
    Code Installation                               : count= 913, sum=    1407, min=       0, average=        1.54, max=     176 (milliseconds), maxTarget=yaml_parser_scan_plain_scalar
  Graal node count                                  :
    After Truffle Tier                              : count= 937, sum=  290542, min=      25, average=      310.08, max=    7299, maxTarget=yaml_parser_parse
    After Graal Tier                                : count= 925, sum=  820131, min=       9, average=      886.63, max=   15400, maxTarget=LLVM:protected_scalar
  Graal compilation result                          :
    Code size                                       : count= 913, sum= 2876095, min=      36, average=     3150.16, max=   62472, maxTarget=yaml_parser_scan_plain_scalar
    Total frame size                                : count= 913, sum=  237856, min=      32, average=      260.52, max=    5024, maxTarget=yaml_parser_scan_plain_scalar
    Exception handlers                              : count= 913, sum=    3406, min=       0, average=        3.73, max=      38, maxTarget=yaml_parser_parse
    Infopoints                                      : count= 913, sum=   27245, min=       1, average=       29.84, max=     424, maxTarget=yaml_parser_scan_plain_scalar
      CALL                                          : count= 913, sum=   27245, min=       1, average=       29.84, max=     424, maxTarget=yaml_parser_scan_plain_scalar
    Marks                                           : count= 913, sum=    9326, min=       2, average=       10.21, max=      50, maxTarget=Hash#_get_or_undefined
    Data references                                 : count= 913, sum=   16560, min=       1, average=       18.14, max=     341, maxTarget=rb_funcallv_public

Oct 15 '20 15:10 eregon

yaml_parser_scan_plain_scalar is a big huge function (AST 1528, IR 3481/11495, CodeSize 64169), and it seems Sulong takes a long time to compile it (~500ms), and it's compiled rather late. Also it deopts twice.

Here is the function after expanding macros with cc -E scanner.c.

$ jt -u native ruby --experimental-options --engine.MultiTier --engine.TraceCompilation --engine.CompilationStatistics bench_yaml.rb |& grep yaml_parser_scan_plain_scalar
[engine] opt done     yaml_parser_scan_plain_scalar                               |AST 1528|Time  409( 130+279 )ms|Tier 1|Inlined   0Y   9N|IR  3481/11495|CodeSize  64169|Addr 0x7f48e8012000|Src scanner.c:3389
[engine] opt inv.     yaml_parser_scan_plain_scalar                               |AST 1528|Tier First|Calls/Thres   21193/    1|CallsAndLoop/Thres  428439/  100|Src scanner.c:3389
[engine] opt deopt    yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   21193/    1|CallsAndLoop/Thres  428444/  100|Src scanner.c:3389
[engine] opt done     yaml_parser_scan_plain_scalar                               |AST 1545|Time  573( 133+440 )ms|Tier 1|Inlined   0Y  10N|IR  3537/11741|CodeSize  65025|Addr 0x7f48e863e000|Src scanner.c:3389
[engine] opt inv.     yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   49545/    1|CallsAndLoop/Thres 1004336/  100|Src scanner.c:3389|Reason nodeRewritingAssumption
[engine] opt deopt    yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   49545/    1|CallsAndLoop/Thres 1004402/  100|Src scanner.c:3389
    Code size                                       : count= 867, sum= 2793123, min=      36, average=     3221.60, max=   65025, maxTarget=yaml_parser_scan_plain_scalar
    Total frame size                                : count= 867, sum=  226976, min=      32, average=      261.79, max=    5056, maxTarget=yaml_parser_scan_plain_scalar
    Infopoints                                      : count= 867, sum=   26199, min=       1, average=       30.22, max=     428, maxTarget=yaml_parser_scan_plain_scalar
      CALL                                          : count= 867, sum=   26199, min=       1, average=       30.22, max=     428, maxTarget=yaml_parser_scan_plain_scalar

To get an idea of timing, using time.rb:

$ ruby time.rb 0.1 tool/jt.rb -u native ruby --experimental-options --engine.MultiTier --engine.TraceCompilation bench_yaml.rb |& grep -E 'yaml_parser_scan_plain_scalar|\[time\]'
[time] 0.1
[time] 0.2
[time] 0.3
[time] 0.4
[time] 0.5
[time] 0.6
[time] 0.7
[time] 0.8
[time] 0.9
[time] 1.0
[time] 1.1
[time] 1.2
[time] 1.3
[time] 1.4
[time] 1.5
[time] 1.6
[time] 1.7
[time] 1.8
[engine] opt done     yaml_parser_scan_plain_scalar                               |AST 1528|Time  520( 127+393 )ms|Tier 1|Inlined   0Y   9N|IR  3485/11500|CodeSize  63522|Addr 0x7fd00d379000|Src scanner.c:3389
[time] 1.9
[time] 2.0
[time] 2.1
[engine] opt inv.     yaml_parser_scan_plain_scalar                               |AST 1528|Tier First|Calls/Thres   21193/    1|CallsAndLoop/Thres  428493/  100|Src scanner.c:3389
[engine] opt deopt    yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   21193/    1|CallsAndLoop/Thres  428498/  100|Src scanner.c:3389
[time] 2.2
[time] 2.3
[time] 2.4
[time] 2.5
[time] 2.6
[engine] opt done     yaml_parser_scan_plain_scalar                               |AST 1545|Time  514( 274+240 )ms|Tier 1|Inlined   0Y  10N|IR  3541/11746|CodeSize  64414|Addr 0x7fd00c9cc000|Src scanner.c:3389
[time] 2.7
[time] 2.8
[time] 2.9
[engine] opt inv.     yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   49545/    1|CallsAndLoop/Thres 1004318/  100|Src scanner.c:3389|Reason nodeRewritingAssumption
[engine] opt deopt    yaml_parser_scan_plain_scalar                               |AST 1545|Tier First|Calls/Thres   49545/    1|CallsAndLoop/Thres 1004384/  100|Src scanner.c:3389
[time] 3.0
[time] 3.1
[time] 3.2
[time] 3.3
[time] 3.4

Oct 15 '20 16:10 eregon

A flamegraph using:

jt -u native profile --engine.MultiTier --metrics-profile-require=detail bench_yaml.rb

shows (total) 60% in parse, 17% in Psych::Nodes::Node#to_ruby, 23% in require psych.

(RubyGems is required but it shouldn't need to be, due to https://github.com/oracle/truffleruby/blob/3c7e1715e37d95c8f79e1e1c7228e66f35f77881/lib/mri/psych.rb#L14, fixed on master)

Oct 15 '20 17:10 eregon

I'm rerunning this with:

require 'psych'
require 'benchmark'
require 'benchmark/ips'

file_name = "YAML.txt"

puts Benchmark.measure {Psych.load(File.read(file_name))}

Benchmark.ips do |x|
  x.iterations = 2
  x.report('Psych.load') { Psych.load(File.read(file_name))}
end

puts Benchmark.measure {Psych.load(File.read(file_name))}

CRuby 2.7.3:

  0.146990   0.003974   0.150964 (  0.151312)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      6.517  (± 0.0%) i/s -     33.000  in   5.067239s
          Psych.load      6.495  (± 0.0%) i/s -     33.000  in   5.084755s
  0.147100   0.000000   0.147100 (  0.147404)

Native CE:

 10.239601   0.207475  10.447076 (  2.152692)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      3.117  (± 0.0%) i/s -     16.000  in   5.143648s
          Psych.load      3.198  (± 0.0%) i/s -     16.000  in   5.017373s
  0.387159   0.002082   0.389241 (  0.296447)

JVM CE:

 32.300689   0.376786  32.677475 (  3.223755)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      4.614  (± 0.0%) i/s -     23.000  in   5.037760s
          Psych.load      5.133  (± 0.0%) i/s -     26.000  in   5.083813s (more iterations have the same perf)
  0.490983   0.000058   0.491041 (  0.201737)

From what I understood from @chrisseaton the actual YAML loading in SR is much larger though, because it takes way more than 2 seconds to load the file for SR startup. Could you share the size of that file or what differs from this one? Optimizing below 2 seconds (a lot about warmup) vs optimizing 20s or so (peak perf most of the time) is probably quite a bit different.

Jun 07 '21 11:06 eregon

JVM CE profile, I used x.iterations = 3: rb_ary_store seems way too high.

$ ruby --cpusampler --cpusampler.Mode=roots bench_yaml.rb
----------------------------------------------------------------------------------------------------------------------------------------------------
 Thread: Thread[main,5,main]
 Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
 rb_ary_store                                                 |       7126ms  29.0% |  99.6% ||       6450ms  26.3% |  99.6% | src/main/c/cext/array.c~82:1943-1999 
 yaml_parser_scan_plain_scalar                                |       3203ms  13.0% |  63.8% ||       1621ms   6.6% |  41.5% | src/main/c/psych/yaml/scanner.c~3389:92053-92125 
 Truffle::RegexpOperations.match_in_region                    |       1594ms   6.5% |  92.6% ||       1387ms   5.6% |  92.3% | resource:/truffleruby/core/truffle/regexp_operations.rb~67-75:2236-2687 
 parse                                                        |      17718ms  72.1% |  80.8% ||       1063ms   4.3% |  56.8% | src/main/c/psych/psych_parser.c~253:6066-6118 
 yaml_free                                                    |        779ms   3.2% |  99.7% ||        779ms   3.2% |  99.7% | src/main/c/psych/yaml/api.c~51:751-770 
 Array#[]=                                                    |        676ms   2.8% |  99.7% ||        676ms   2.8% |  99.7% | (core)~1:0 
 CExt#rb_funcallv_public                                      |       1214ms   4.9% |  96.5% ||        610ms   2.5% |  97.9% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/truffle/truffle/cext.rb~872-874:18582-18696 
 yaml_parser_update_buffer                                    |        614ms   2.5% |  36.8% ||        599ms   2.4% |  35.9% | src/main/c/psych/yaml/reader.c~142:3763-3825 
 Truffle::Interop.execute_without_conversion                  |       9894ms  40.3% |  99.1% ||        477ms   1.9% |  97.3% | (core)~1:0 
 Kernel#require                                               |       1110ms   4.5% |   0.0% ||        469ms   1.9% |   0.0% | resource:/truffleruby/core/kernel.rb~243-277:7681-8984 
 yaml_malloc                                                  |        422ms   1.7% |  99.5% ||        422ms   1.7% |  99.5% | src/main/c/psych/yaml/api.c~31:443-466 
 Array#<<                                                     |        353ms   1.4% |  98.3% ||        353ms   1.4% |  98.3% | (core)~1:0 
 rb_funcallv_public                                           |       9070ms  36.9% |  98.2% ||        334ms   1.4% |  84.1% | src/main/c/cext/call.c~35:1291-1374 
 rb_ary_new_from_values                                       |       7525ms  30.6% |  99.7% ||        303ms   1.2% |  99.0% | src/main/c/cext/array.c~53:1347-1405 
 Psych::ScalarScanner#tokenize                                |       2865ms  11.7% |  86.4% ||        287ms   1.2% |  77.7% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/mri/psych/scalar_scanner.rb~31-99:1008-3185 
 String#-@                                                    |        430ms   1.8% |  89.3% ||        278ms   1.1% |  87.4% | resource:/truffleruby/core/string.rb~1510-1519:39739-39946 
 yaml_event_delete                                            |        599ms   2.4% |  97.8% ||        269ms   1.1% |  95.5% | src/main/c/psych/yaml/api.c~980:24537-24574 
 Psych::Visitors::Visitor#visit                               |       5267ms  21.4% |  43.7% ||        248ms   1.0% |  80.2% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/mri/psych/visitors/visitor.rb~15-17:264-339

All the rb_ary_store calls are like:

rb_ary_store
rb_ary_new_from_values
rb_funcallv_public (aliased rb_funcall3)

Looks like we need to optimize rb_ary_new_from_values() (notably, rb_ary_new_capa ignores capacity).

Jun 07 '21 12:06 eregon

After optimizing rb_funcallv_public() to not call rb_ary_new_from_values but directly unwrap the passed VALUE*, JVM CE is significantly faster on peak than CRuby!

 41.520555   0.445382  41.965937 (  3.642403)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      9.779  (±10.2%) i/s -     49.000  in   5.073748s
          Psych.load      9.717  (±10.3%) i/s -     48.000  in   5.018427s
          Psych.load     10.038  (±10.0%) i/s -     50.000  in   5.029198s
  0.502782   0.002007   0.504789 (  0.109505)

Profile after that:

----------------------------------------------------------------------------------------------------------------------------------------------------
 Thread: Thread[main,5,main]
 Name                                                         |      Total Time     |  Opt % ||       Self Time     |  Opt % | Location             
----------------------------------------------------------------------------------------------------------------------------------------------------
 yaml_parser_scan_plain_scalar                                |       4578ms  19.0% |  66.0% ||       2422ms  10.1% |  47.7% | src/main/c/psych/yaml/scanner.c~3389:92053-92125 
 Truffle::RegexpOperations.match_in_region                    |       2784ms  11.6% |  93.0% ||       2421ms  10.1% |  92.4% | src/main/ruby/truffleruby/core/truffle/regexp_operations.rb~67-75:2236-2687 
 parse                                                        |      14135ms  58.8% |  72.3% ||       1425ms   5.9% |  54.1% | src/main/c/psych/psych_parser.c~253:6066-6118 
 yaml_free                                                    |       1119ms   4.7% |  99.7% ||       1119ms   4.7% |  99.7% | src/main/c/psych/yaml/api.c~51:751-770 
 yaml_parser_update_buffer                                    |        752ms   3.1% |  37.4% ||        740ms   3.1% |  36.8% | src/main/c/psych/yaml/reader.c~142:3763-3825 
 Truffle::Interop.execute_without_conversion                  |       3163ms  13.2% |  98.5% ||        686ms   2.9% |  99.0% | (core)~1:0 
 CExt#rb_funcallv_public                                      |       1410ms   5.9% |  96.6% ||        660ms   2.7% |  97.3% | ../graal/sdk/mxbuild/linux-amd64/GRAALVM_92F985B265_JAVA11/graalvm-92f985b265-java11-21.2.0-dev/languages/ruby/lib/truffle/truffle/cext.rb~872-874:18582-18696 
 yaml_malloc                                                  |        623ms   2.6% |  99.7% ||        623ms   2.6% |  99.7% | src/main/c/psych/yaml/api.c~31:443-466 
 Array#<<                                                     |        535ms   2.2% |  98.3% ||        535ms   2.2% |  98.3% | (core)~1:0 
 rb_funcallv_public                                           |       1929ms   8.0% |  94.0% ||        519ms   2.2% |  89.6% | src/main/c/cext/call.c~35:1291-1374

Jun 07 '21 13:06 eregon

Improvements to rb_funcallv*() are now merged: https://github.com/oracle/truffleruby/commit/a50c8e527b92f48c8ac096e8172fbd2dcd181575

Jun 10 '21 09:06 eregon

I have tried with a native libyaml instead of the bundled one, by removing --enable-bundled-libyaml in src/main/c/Makefile. Here is a branch with that: https://github.com/oracle/truffleruby/compare/master...eregon:truffleruby:use-system-yaml ruby -ryaml -e 'p Psych.libyaml_version' prints the libyaml version and it's 0.2.5 in both cases. This is on Native CE.

CRuby 3.0.3:
$ for i in 1 2 3; do ruby bench_yaml.rb; done                                                                                                                                  
  0.133155   0.005777   0.138932 (  0.140035)
  0.137491   0.004078   0.141569 (  0.141988)
  0.138491   0.005113   0.143604 (  0.144235)
$ ruby bench_yaml_peak.rb                    
  0.133569   0.004986   0.138555 (  0.138941)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      7.334  (± 0.0%) i/s -     37.000  in   5.051702s
          Psych.load      7.439  (± 0.0%) i/s -     38.000  in   5.112715s
          Psych.load      7.553  (± 0.0%) i/s -     38.000  in   5.034971s
  0.124949   0.000001   0.124950 (  0.125305)

TruffleRuby master 7314b6769245c5e0cce5940e27b57ec2d0116cad:

 25.969647   0.788488  26.758135 (  6.979160)
 26.047104   0.750634  26.797738 (  7.001821)
 26.819784   0.818494  27.638278 (  7.287955)

 25.457242   0.881856  26.339098 (  7.018586)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      3.777  (± 0.0%) i/s -     19.000  in   5.107471s
          Psych.load      3.208  (± 0.0%) i/s -     17.000  in   5.335320s
          Psych.load      3.327  (± 0.0%) i/s -     17.000  in   5.123493s
  0.658149   0.021080   0.679229 (  0.288992)

TruffleRuby master 7314b6769245c5e0cce5940e27b57ec2d0116cad without --enable-bundled-libyaml:

  5.165586   0.074782   5.240368 (  1.317167)
  5.254264   0.069016   5.323280 (  1.340259)
  5.534712   0.099797   5.634509 (  1.346993)

  4.920097   0.076287   4.996384 (  1.251470)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      4.093  (± 0.0%) i/s -     21.000  in   5.170312s
          Psych.load      4.122  (± 0.0%) i/s -     21.000  in   5.149288s
          Psych.load      4.193  (± 0.0%) i/s -     21.000  in   5.045918s
  0.306169   0.014968   0.321137 (  0.257934)

So using a native libyaml seems significantly faster on startup/warmup, and seems at least as fast on peak.

bench_yaml.rb:

require 'psych'
require 'benchmark'

file_name = "YAML.txt"

puts Benchmark.measure {
  Psych.load(File.read(file_name))
}

bench_yaml_peak.rb:

require 'psych'
require 'benchmark'
require 'benchmark/ips'

file_name = "YAML.txt"

puts Benchmark.measure {Psych.load(File.read(file_name))}

Benchmark.ips do |x|
  x.warmup = 5
  x.iterations = 3
  x.report('Psych.load') { Psych.load(File.read(file_name))}
end

puts Benchmark.measure {Psych.load(File.read(file_name))}

Jun 27 '22 13:06 eregon

Rerunning this on Native CE: TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd:

 25.194098   0.933028  26.127126 (  7.197953)
 25.284336   0.923600  26.207936 (  7.119082)
 25.757382   0.892540  26.649922 (  7.150440)

 23.867417   0.921593  24.789010 (  6.775556)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      5.300  (±37.7%) i/s -     25.000  in   5.082388s
          Psych.load      5.813  (±34.4%) i/s -     28.000  in   5.127598s
          Psych.load      5.518  (±18.1%) i/s -     28.000  in   5.223663s
  0.303520   0.024993   0.328513 (  0.262015)

TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd with __attribute__((noinline)) on yaml_parser_fetch_next_token: This is because there seems to be an LLVM issue (GR-39490, https://github.com/llvm/llvm-project/issues/57006) where the caller inlines yaml_parser_fetch_next_token and that blows up the bitcode and the number of local variables way more than it should, leading to bad warmup. The noinline prevents that, and makes the 1st iteration almost twice as fast.

 14.544912   0.385808  14.930720 (  3.754364)
 14.786649   0.423228  15.209877 (  3.900702)
 12.700065   0.355561  13.055626 (  3.286032)

 14.659294   0.449365  15.108659 (  3.892674)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      5.447  (±18.4%) i/s -     27.000  in   5.075687s
          Psych.load      4.376  (±22.9%) i/s -     22.000  in   5.099205s
          Psych.load      4.257  (±23.5%) i/s -     21.000  in   5.003525s
  0.263249   0.017929   0.281178 (  0.262726)

TruffleRuby master 2031318c8309bc0874df7f5b037c3281ee64a7dd without --enable-bundled-libyaml (so native system libyaml):

  5.118278   0.083548   5.201826 (  1.295927)
  4.953188   0.086894   5.040082 (  1.227787)
  5.286293   0.083495   5.369788 (  1.325290)

  5.157928   0.070645   5.228573 (  1.294500)
Warming up --------------------------------------
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
          Psych.load     1.000  i/100ms
Calculating -------------------------------------
          Psych.load      4.251  (± 0.0%) i/s -     22.000  in   5.231084s
          Psych.load      4.129  (±24.2%) i/s -     21.000  in   5.175657s
          Psych.load      4.135  (±24.2%) i/s -     21.000  in   5.181485s
  0.309581   0.020035   0.329616 (  0.262114)

Aug 17 '22 15:08 eregon

Given upstream is removing the bundling of libyaml (https://bugs.ruby-lang.org/issues/18571, in Ruby 3.2), I think the simplest option here is to just add libyaml as a dependency, and then use the system libyaml (numbers in https://github.com/oracle/truffleruby/issues/2089#issuecomment-1167329301).

Oct 07 '22 11:10 eregon

Psych 5 no longer bundles libyaml, so installing psych 5 will use the system libyaml. Other than that, we can either wait that TruffleRuby updates to Ruby 3.2 or make TruffleRuby disable the usage of the the bundled libyaml explicitly and add the libyaml dependency as requirement.

Dec 05 '22 18:12 eregon

As a heads up, Homebrew on M1 doesn't add header files to a standard location, so the patch to use libyaml needs some extra configuration. I'm using the following for now:

diff --git a/src/main/c/Makefile b/src/main/c/Makefile
index 5fe0930245..a4e327addf 100644
--- a/src/main/c/Makefile
+++ b/src/main/c/Makefile
@@ -161,7 +161,7 @@ rbconfig-sizeof/sizeof.$(DLEXT): rbconfig-sizeof/Makefile rbconfig-sizeof/sizes.
 # psych
 # Always use the bundled libyaml, as we want it as bitcode and avoid extra handles
 psych/Makefile: psych/extconf.rb $(EXTCONF_DEPS)
-       $(Q) cd psych && $(RUBY) extconf.rb --enable-bundled-libyaml || $(IF_EXTCONF_FAIL)
+       $(Q) cd psych && $(RUBY) extconf.rb --with-libyaml-dir=/opt/homebrew/Cellar/libyaml/0.2.5 || $(IF_EXTCONF_FAIL)

 psych/psych.$(DLEXT): psych/Makefile psych/*.c psych/*.h psych/yaml/*.c psych/yaml/*.h
        $(Q) cd psych && $(MAKE) $(MKMF_MAKEFILE_SUBPROCESS_FLAGS)

We already handle that for OpenSSL.

While using libyaml does speed things up, our YAML processing is still quite a bit slower than MRI. I think a common and simple example is using the faker gem. Running the following script with Faker 3.1.0 (the latest release as of today):

require 'benchmark'
require 'faker'

puts Benchmark.measure { Faker::Config.locale }

I see the following timings:

TruffleRuby 22.3.0:

> ruby -v faker.rb
truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [aarch64-darwin]
 34.324998   1.046779  35.371777 ( 12.052866)

TruffleRuby 23.0.0-dev with libyaml:

> ruby -v faker.rb
truffleruby 23.0.0-dev-71f07786, like ruby 3.1.3, GraalVM CE Native [aarch64-darwin]
 11.324725   0.419365  11.744090 (  4.936416)

MRI 3.1.3:

> ruby -v faker.rb
ruby 3.1.3p185 (2022-11-24 revision 1a6b16756e) [arm64-darwin21]
  0.462839   0.031005   0.493844 (  0.498847)

MRI 3.2.0:

> ruby -v faker.rb
ruby 3.2.0 (2022-12-25 revision a528908271) [arm64-darwin21.5.0]
  0.486561   0.026892   0.513453 (  0.528007)

While there's a huge improvement using libyaml, it's still incredibly slow relative to MRI.

Ultimately, all this Faker line is doing is loading a bunch of YAML files for I18n. And that's the real common case I'm seeing. Faker just happens to load a bunch of localization files. But, Faker is commonly used in unit tests for generating dummy data. Even without Faker, it's common to have localized data in Rails and I'm seeing other tools slow down quite a bit loading localization files with I18n.

Jan 19 '23 02:01 nirvdrum

We discussed using the system libyaml on Slack. The ABI of libyaml seems quite stable since soname 0.2, even more than zlib actually: https://abi-laboratory.pro/index.php?view=timeline&l=zlib https://abi-laboratory.pro/index.php?view=timeline&l=openssl https://abi-laboratory.pro/index.php?view=timeline&l=yaml

So we might not need to recompile psych on every system and be able to reuse the precompiled one in the truffleruby build (like for zlib, unlike for openssl).

In terms of packages (from https://github.com/rbenv/ruby-build/wiki): RHEL-based: libyaml-devel or libyaml for non-dev Debian-based: libyaml-dev or libyaml-0-2 for non-dev Homebrew: libyaml We pick -dev versions so if we need to recompile in the future there won't be a need to change dependencies, and also CRuby requires them.

Feb 28 '23 15:02 eregon

With https://github.com/oracle/truffleruby/pull/2899 TruffleRuby now uses the system libyaml, like CRuby 3.2+ and Psych 5+. That improves warmup of YAML.parse quite a bit as showed above.

We need to look at the case of Faker and see what we can optimize better there. It would be good to have peak number for the Faker example, to see what's the difference after warmup.

Mar 15 '23 11:03 eregon

If I modify bootsnap to enable the YAML compile cache it speeds Faker::Config.locale for me by 50%.

faker:  29.136432   0.396282  29.532714 (  4.765890)
faker:  12.030619   0.246628  12.277247 (  2.349102)

Jun 29 '23 23:06 rwstauner

When I run Kevin's script locally I'm seeing the same results: TruffleRuby is taking 4s longer than CRuby for Faker::Config.locale.

One candidate for improvement that I see is Truffle::RegexpOperations.match_in_region. With the cpusampler, i see Regexp.match? costing

~620ms regular
~230ms with --use-truffle-regex=false
~80ms If I precompute the results and store them in a hash and use that

So there is a potential savings of up to 400-500ms, which could account for as much as 12% of the 4s difference that I get.

Jun 30 '23 00:06 rwstauner

If I modify bootsnap to enable the YAML compile cache it speeds Faker::Config.locale for me by 50%.

So that caches the YAML with MessagePack? That's a C extension that needs some warmup so it's surprising it's that much faster.

Jul 03 '23 09:07 eregon

@djoooooe Any idea what we could improve regarding https://github.com/oracle/truffleruby/issues/2089#issuecomment-1613936967 ?

Jul 03 '23 09:07 eregon

truffleruby truffleruby copied to clipboard

Psych.load (for YAML loading) slow

truffleruby
truffleruby copied to clipboard