jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8332689: RISC-V: Use load instead of trampolines

Open robehn opened this issue 8 months ago • 17 comments

Hi all, please consider!

Today we do JAL to dest if dest is in reach (+/- 1 MB). Using a very small application or running very short time we have fast patchable calls. But any normal application running longer will increase the code size and code chrun/fragmentation. So whatever or not you get hot fast calls rely on luck.

To be patchable and get code cache reach we also emit a stub trampoline which we can point the JAL to. This would be the common case for a patchable call.

Code stream:
JAL <trampo>
Stubs:
AUIPC
LD
JALR
<DEST>

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive. Even if you don't have that problem having a call to a jump is not the fastest way. Loading the address avoids the pitsfalls of cmodx.

This patch suggest to solve the problems with trampolines, we take small penalty in the naive case of JAL to dest, and instead do by default:

Code stream:
AUIPC
LD
JALR
Stubs:
<DEST>

An experimental option for turning trampolines back on exists.

It should be possible to enhanced this with the WIP Zjid by changing the JALR to JAL and nop out the auipc+ld (as the current proposal of Zjid forces the I-fetcher to fetch instruction in order (meaning we will avoid a lot issues which arm has)) when in reach and vice-versa.

Numbers from VF2 (I have done them a few times, they are always overall in favor of this patch):

fop                                        (msec)    2239       |  2128       =  0.950424
h2                                         (msec)    18660      |  16594      =  0.889282
jython                                     (msec)    22022      |  21925      =  0.995595
luindex                                    (msec)    2866       |  2842       =  0.991626
lusearch                                   (msec)    4108       |  4311       =  1.04942
lusearch-fix                               (msec)    4406       |  4116       =  0.934181
pmd                                        (msec)    5976       |  5897       =  0.98678
jython                                     (msec)    22022      |  21925      =  0.995595
Avg:                                       0.974112                              
fop(xcomp)                                 (msec)    2721       |  2714       =  0.997427
h2(xcomp)                                  (msec)    37719      |  38004      =  1.00756
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
luindex(xcomp)                             (msec)    5303       |  5512       =  1.03941
lusearch(xcomp)                            (msec)    6702       |  6271       =  0.935691
lusearch-fix(xcomp)                        (msec)    6721       |  6217       =  0.925011
pmd(xcomp)                                 (msec)    6835       |  6587       =  0.963716
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
Avg:                                       0.99154                               
o.r.actors.JmhAkkaUct.run                  (ms/op)   8585.440   |  7548.347   =  0.879203
o.r.actors.JmhReactors.run                 (ms/op)   65004.694  |  64448.824  =  0.991449
o.r.jdk.concurrent.JmhFjKmeans.run         (ms/op)   47751.653  |  45747.490  =  0.958029
o.r.jdk.concurrent.JmhFutureGenetic.run    (ms/op)   12083.628  |  11427.650  =  0.945713
o.r.jdk.streams.JmhMnemonics.run           (ms/op)   32691.025  |  31002.088  =  0.948336
o.r.jdk.streams.JmhParMnemonics.run        (ms/op)   27500.431  |  23747.117  =  0.863518
o.r.jdk.streams.JmhScrabble.run            (ms/op)   3688.182   |  3528.943   =  0.956825
o.r.neo4j.JmhNeo4jAnalytics.run            (ms/op)   20153.371  |  21704.731  =  1.07698
o.r.rx.JmhRxScrabble.run                   (ms/op)   1197.749   |  1160.465   =  0.968872
o.r.scala.dotty.JmhDotty.run               (ms/op)   18385.552  |  18561.341  =  1.00956
o.r.scala.sat.JmhScalaDoku.run             (ms/op)   25243.887  |  22112.289  =  0.875946
o.r.scala.stdlib.JmhScalaKmeans.run        (ms/op)   2610.509   |  2498.539   =  0.957108
o.r.scala.stm.JmhPhilosophers.run          (ms/op)   5875.997   |  6101.689   =  1.03841
o.r.scala.stm.JmhScalaStmBench7.run        (ms/op)   8723.122   |  8760.115   =  1.00424
o.r.twitter.finagle.JmhFinagleChirper.run  (ms/op)   21209.541  |  21732.213  =  1.02464
o.r.twitter.finagle.JmhFinagleHttp.run     (ms/op)   20782.221  |  20390.960  =  0.981173
Avg:                                       0.9675            

It's been throught a couple of t1-t3, but I need to re-run test after latest merge.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8332689: RISC-V: Use load instead of trampolines (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19453/head:pull/19453
$ git checkout pull/19453

Update a local copy of the PR:
$ git checkout pull/19453
$ git pull https://git.openjdk.org/jdk.git pull/19453/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19453

View PR using the GUI difftool:
$ git pr show -t 19453

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19453.diff

Webrev

Link to Webrev Comment

robehn avatar May 29 '24 12:05 robehn