truffleruby icon indicating copy to clipboard operation
truffleruby copied to clipboard

Data read with Zlib::GzipReader (but not copy of same data) slow to parse with JSON.parse

Open billdueber opened this issue 4 years ago • 11 comments

(first issue opened here; if I'm off the mark or in the wrong spot, my apologies and please let me know)

JSON-parsing data read from a gzipped file with Zlib::GzipReader is very slow. Parsing the same byte-for-byte data read from a non-gzipped file is not slow. Parsing a copy of the data read from a gzipped file is also not slow.

This gist gives a small program that creates a bunch of arrays-of-arrays and dumps json output into two files, one plain and one gzipped. It then reads those data back in via File.read or Zlib::GzipReader, additionally making a copy of the formerly-gzipped-data.

Note that I'm not parsing every time -- I'm just getting a copy of the data (as seen in this snippet from the gist):

plain_data = File.read(PLAIN)

previously_gzipped_data = Zlib::GzipReader.open(GZIP) do |f|
   f.read
end

forced_copy = previously_gzipped_data + " "

Oj doesn't show this same issue. I know benchmarking this stuff is treacherous, but the pattern seems pretty consistent.


truffleruby 20.3.0, like ruby 2.6.6, GraalVM CE JVM [x86_64-darwin]
Testing with an array of 1000 20-element arrays


BEGIN STDLIB JSON
                           user     system      total        real

Plain                      6.761722   0.218166   6.979888 (  1.597190)
Previously gzipped        15.290809   1.759818  17.050627 (  5.812477)
Gzipped/forced copy        0.112019   0.052004   0.164023 (  0.069019)

Plain                      0.060666   0.040497   0.101163 (  0.061761)
Previously gzipped         4.385712   1.001380   5.387092 (  4.677956)
Gzipped/forced copy        0.135513   0.002502   0.138015 (  0.062301)

Plain                      0.126698   0.004074   0.130772 (  0.067788)
Previously gzipped         4.375435   0.766456   5.141891 (  4.427382)
Gzipped/forced copy        0.055923   0.000771   0.056694 (  0.057266)


BEGIN Oj
                           user     system      total        real
Plain                      8.310956   0.346334   8.657290 (  1.903706)
Previously gzipped         0.849619   0.016991   0.866610 (  0.225177)
Gzipped/forced copy        0.753260   0.026104   0.779364 (  0.228311)
Plain                      0.540689   0.011607   0.552296 (  0.206072)
Previously gzipped         0.417603   0.008346   0.425949 (  0.195774)
Gzipped/forced copy        0.396726   0.013893   0.410619 (  0.206867)
Plain                      0.400436   0.010877   0.411313 (  0.218068)
Previously gzipped         0.421178   0.009404   0.430582 (  0.204174)
Gzipped/forced copy        0.34264

My original data was just a file on disk as gzipped with gzip, so I don't think this has anything to do with Zlib::GzipWriter -- that's just to make it a self-contained bug reproduction. I'm assuming that GzipReader is hanging on to the string in some way that makes things nasty?

-Bill-

billdueber avatar Dec 17 '20 19:12 billdueber

Can you try putting Truffle::Ropes.flatten_rope(data) on your data after you get it from gzip and before you use it?

chrisseaton avatar Dec 17 '20 20:12 chrisseaton

Adding previously_gzipped_data = Truffle::Ropes.flatten_rope(previously_gzipped_data) does the trick (although it doesn't help me for the moment since the calls are locked up in a gem). Hopefully it'll help point you in the right direction.

billdueber avatar Dec 17 '20 22:12 billdueber

Thank you for the great report, it looks like pure-Ruby JSON parsing is slower with the String from Zlib::GzipReader.

Truffle::Ropes.flatten_rope() seems a good hint that this is an issues with Ropes. Of course, that's just helpful to debug, it should work fine without any explicit call (Truffle::Ropes.flatten_rope is a private debug-like API, just like everything else under Truffle::).

Maybe it has many ConcatRope and the reading accesses in JSON don't force to materialize the byte[]? Oj likely uses RSTRING_PTR() and that forces to flatten the Rope. @norswap Could you investigate?

@billdueber To make sure the pure-Ruby version of json, from stdlib, is used, could you show gem list | grep json on TruffleRuby? (it should be json (default: 2.3.0)) If the json gem is installed, then it will be used as a C extension and preferred to the pure-Ruby version, which is unfortunately rather confusing but that's the semantics of default gems.

eregon avatar Dec 18 '20 12:12 eregon

It's simpler than that, actually:

p Truffle::CExt.native_string?(previously_gzipped_data) # => true

forced_copy = previously_gzipped_data + " "
p Truffle::CExt.native_string?(forced_copy) # => false

json = JSON.parse(previously_gzipped_data)
p Truffle::CExt.native_string?(previously_gzipped_data) # => true

So the issue when we use previously_gzipped_data, is that we use a NativeRope instead of a ManagedRope. And I would expect that String migrates from NativeRope to ManagedRope only once, but that doesn't seem the case since it's still native even after JSON.parse(previously_gzipped_data).

eregon avatar Dec 18 '20 12:12 eregon

Most likely involved, this force_encoding in JSON: https://github.com/oracle/truffleruby/blob/e020c89a8368ef2e1b996904f7becf65b375a47e/lib/json/lib/json/pure/parser.rb#L140 previously_gzipped_data.encoding is UTF-8 (maybe a bit surprising? But MRI behaves the same).

And force_encoding actually returns a NativeRope when the source is a NativeRope. The .encode above the force_encoding also returns a different String, so the original String is never modified and stays native.

Then this String is passed to StringScanner, and StringScanner#scan is used which will run Regexps against it. Regexps use a RopeNodes.BytesNode, which for a NativeRope will allocate a byte[] and copy from native memory every time, which seems very likely to be the performance issue. Maybe BytesNode should take a String and not a Rope, so it could transition the String to managed if it's native only once, and store the ManagedRope back in the String. Also BytesNode seems a good indication of "read-heavy workload coming", so it seems a good place to do that.

eregon avatar Dec 18 '20 12:12 eregon

JSON gem (almost) just as you expected -- I apparently have 2.1.0

✦ ❯ gem list | grep json
json (default: 2.1.0)

billdueber avatar Dec 18 '20 15:12 billdueber

Right, that's expected for TruffleRuby 20.3.0, thanks for confirming.

eregon avatar Dec 18 '20 16:12 eregon

Just pinging on this to see if it's on anyone's radar. I've updated the benchmarking code to explicitly assert that calling Truffle::Debug.flatten_string (nee Truffle::Ropes.flatten_rope) addresses the issue (it does, as expected).

Removed Oj because it make the "x slower" comparison so hard to read, but it's about 10x faster than JSON and doesn't suffer from gzip slowdown.

tl;dr is that on my hardware gzdata is about 30x slower on the smaller (array-of-100-arrays) json string, and about 400x slower on the larger json string.


truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [x86_64-darwin]
Base unit is an array of 20 integers

JSON-decode a string encoding an array of 100/1000 of those base units.

Warming up --------------------------------------
      JSON 100 plain    19.000  i/100ms
     JSON 100 gzdata     1.000  i/100ms
     JSON 1000 plain     2.000  i/100ms
    JSON 1000 gzdata     1.000  i/100ms
Calculating -------------------------------------
      JSON 100 plain    722.230  (±15.2%) i/s -      7.030k in  10.011312s
     JSON 100 gzdata     26.395  (±34.1%) i/s -    202.000  in  10.023596s
     JSON 1000 plain     73.287  (±13.6%) i/s -    716.000  in  10.003276s
    JSON 1000 gzdata      0.166  (± 0.0%) i/s -      2.000  in  12.601048s

Comparison:
      JSON 100 plain:      722.2 i/s
     JSON 1000 plain:       73.3 i/s - 9.85x  (± 0.00) slower
     JSON 100 gzdata:       26.4 i/s - 27.36x  (± 0.00) slower
    JSON 1000 gzdata:        0.2 i/s - 4355.93x  (± 0.00) slower


billdueber avatar Dec 01 '22 20:12 billdueber

Thank you, we should indeed look at this again, especially since the migration to TruffleString which should help with this. Could you share your new benchmark file? In the gist it seems to be the older benchmark (e.g. using oj and not using benchmark/ips), but the newer benchmark seems much simpler and uses benchmark/ips so I'd gladly use that to investigate. The older benchmark file might still be needed to generate the files though.

eregon avatar Dec 05 '22 13:12 eregon

I threw up another gist with a less-organically-written benchmark script that strictly compares one set of plain data vs one set of gzipped data. https://gist.github.com/billdueber/87b4f100c4a2d5d1c756470e09615857

Sample output:

✦ ❯ ruby slow_gzip_bench.rb 1000 10 10

truffleruby 22.3.0, like ruby 3.0.3, GraalVM CE Native [x86_64-darwin]
Base unit is an array of 20 integers
Rows: 1000; Warmup seconds: 10; run test for 10 seconds

Warming up --------------------------------------
               plain     7.000  i/100ms
              gzdata     1.000  i/100ms
Calculating -------------------------------------
               plain     92.525  (±10.8%) i/s -    924.000  in  10.134733s
              gzdata      0.176  (± 0.0%) i/s -      2.000  in  11.384485s

Comparison:
               plain:       92.5 i/s
              gzdata:        0.2 i/s - 526.66x  (± 0.00) slower

billdueber avatar Dec 06 '22 03:12 billdueber

Thank you, I can reproduce it with truffleruby-dev. And adding:

if IS_TRUFFLE
  FILEDATA["gzdata"] = Truffle::Debug.flatten_string(FILEDATA["gzdata"])
end

makes them both the same speed.

eregon avatar Dec 06 '22 17:12 eregon