jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8287385: Suppress superficial unstable_if traps

Open navyxliu opened this issue 3 years ago • 10 comments
trafficstars

An unstable if trap is superficial if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial.

The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code.

For example, C2 generates an uncommon_trap for the else if cond is very likely true.

    public static int foo(boolean cond, int i) {
        Value x = new Value(0);
        Value y = new Value(1);
        Value z = new Value(i);

        if (cond) {
            i++;
        }
        return x._value + y._value + z._value + i;
    }

If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this

037     call,static  wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0')
        # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0
        # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 }
        # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 }
        # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 }
        # OopMap {off=60/0x3c}
03c     stop    # ShouldNotReachHere

Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly'

<-XX:-OptimizeUnstableIf>
Compiled method (c2)     346   17       4       SuperficialIfTrap::foo (53 bytes)
 total in heap  [0x00007f50f4970910,0x00007f50f4970b70] = 608
 relocation     [0x00007f50f4970a70,0x00007f50f4970a80] = 16
 main code      [0x00007f50f4970a80,0x00007f50f4970ad8] = 88
 stub code      [0x00007f50f4970ad8,0x00007f50f4970af0] = 24
 oops           [0x00007f50f4970af0,0x00007f50f4970b00] = 16
 metadata       [0x00007f50f4970b00,0x00007f50f4970b08] = 8
 scopes data    [0x00007f50f4970b08,0x00007f50f4970b38] = 48
 scopes pcs     [0x00007f50f4970b38,0x00007f50f4970b68] = 48
 dependencies   [0x00007f50f4970b68,0x00007f50f4970b70] = 8

<-XX:+OptimizeUnstableIf>
Compiled method (c2)     309   17       4       SuperficialIfTrap::foo (53 bytes)
 total in heap  [0x00007f4090970910,0x00007f4090970b18] = 520
 relocation     [0x00007f4090970a70,0x00007f4090970a80] = 16
 main code      [0x00007f4090970a80,0x00007f4090970ac8] = 72
 stub code      [0x00007f4090970ac8,0x00007f4090970ae0] = 24
 oops           [0x00007f4090970ae0,0x00007f4090970ae8] = 8
 scopes data    [0x00007f4090970ae8,0x00007f4090970af0] = 8
 scopes pcs     [0x00007f4090970af0,0x00007f4090970b10] = 32
 dependencies   [0x00007f4090970b10,0x00007f4090970b18] = 8

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9601/head:pull/9601
$ git checkout pull/9601

Update a local copy of the PR:
$ git checkout pull/9601
$ git pull https://git.openjdk.org/jdk pull/9601/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9601

View PR using the GUI difftool:
$ git pr show -t 9601

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9601.diff

navyxliu avatar Jul 21 '22 19:07 navyxliu

:wave: Welcome back xliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Jul 21 '22 19:07 bridgekeeper[bot]

@navyxliu The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Jul 21 '22 19:07 openjdk[bot]

Webrevs

mlbridge[bot] avatar Jul 21 '22 20:07 mlbridge[bot]

hi, @vnkozlov and @merykitty

Did you address @merykitty comment in RFE? You said: it looks like this JBS does have this downsize, I will investigate this problem

I think about this. First of all, I admit that this does impact "peak" performance. In a nutshell, that is where tracing JIT is superior than method-based compilation. But c2 is a method-based with heroic optimizations. This change makes Java execution more predictable and less unstable_if traps.

Secondly, c2 is adaptive. too_many_traps() and PerBytecodeTrapLimit both lead the final revision of nmethod to include both paths like this patch does, unless the real execution never take another path. Code like that is doubtful. I ran both SpecJVM2008 and Renaissance. I haven't seen difference in peak performance.

When it comes to constant prorogation, I think it will have 2 positive outcomes. A constant can simplify control flow and reduce the strength of arithmetic computation. In the first case, actually, it won't change too much because c2 profiles branches and we still prune non-superficial paths on the basis of possibilities even though values are not constant. For the second one, I think it should make different. To mitigate it, I came up an improvement. I can't guarantee to detect all cases where merging hinders constant folding. It's because Constant Propagation happens in optimizer but this patch is working in parsing time. I think I can detect the simple case like this in 'UnstableIfTrap::suppress'.

        int i = x;
        if (cond) {
            i = 0;
        }

When we attempt to create a phi node for i and we realize that the previous value is a constant, or ConI#0 in this case we give up suppressing. How about it?

navyxliu avatar Jul 28 '22 01:07 navyxliu

I thought about this change more. You are trading performance lost with saving of some space in CodeCache. I don't think we should do this. C2's one of main optimization is class propagation based on profiling (checkcast). It allows significantly reduce following code if profiling shows only one class was observed. I am not sure if removal of "superficial" uncommon trap will not obstruct this and other similar optimizations.

vnkozlov avatar Jul 28 '22 22:07 vnkozlov

hi, @vnkozlov

I thought about this change more. You are trading performance lost with saving of some space in CodeCache. I don't think we should do this.

Thanks you for taking look this. you're right.

I can provide some datapoints from my experiments. I do see that it can reduce some compilations. I ran Renaissance with -Xlog:deoptimization=debug and piped logs to grep "level=4.*unstable_if" | cut -c 29- | sort -h | uniq | wc -l. It counts the deoptimization events due to unstable_if. eg. [debug][deoptimization] cid=1849 level=4 java.util.concurrent.ForkJoinPool.scan(Ljava/util/concurrent/ForkJoinPool$WorkQueue;II)I trap_bci=137 unstable_if reinterpret pc=0x00007fd6f0a5b790 relative_pc=0x0000000000000430

I found that this reduces 11%(median) deoptimzation of unstable_if. Unfortunately, those events are rare and they won't make much difference. Given the fact that the JIT compilers are both multi-threaded and concurrent, the overhead of JIT is super low. Secondly, I am surprised that hotspot is very responsive. it quickly recompiles deopt'ed methods with new information and replaces the old nmethod with a new revision which avoids uncommon_trap. I hardly observe codecache savings. To put them together, I have to concede this feature doesn't make sense.

benchmark Before After diff
scrabble 18 16 -11.11%
page-rank 193 181 -6.22%
future-genetic 38 38 0.00%
akka-uct 118 99 -16.10%
movie-lens 198 185 -6.57%
scala-doku 44 44 0.00%
chi-square 125 104 -16.80%
fj-kmeans 26 19 -26.92%
rx-scrabble 35 34 -2.86%
finagle-http 173 126 -27.17%
reactors 81 72 -11.11%
dec-tree 200 170 -15.00%
scala-stm-bench7 70 66 -5.71%
naive-bayes 171 144 -15.79%
als 214 186 -13.08%
par-mnemonics 19 19 0.00%
scala-kmeans 15 13 -13.33%
philosophers 35 31 -11.43%
log-regression 184 148 -19.57%
gauss-mix 145 119 -17.93%
mnemonics 13 13 0.00%
dotty 393 338 -13.99%
finagle-chirper 268 264 -1.49%

Speaking of "class propagation", do you mean UseTypeSpeculation? I will take a closer look at this feature.

thanks, --lx

navyxliu avatar Jul 29 '22 19:07 navyxliu

When I said "performance" I meant performance of compiled code. Uncommon trap (vs Phi) allows to narrow a value/type/class after check in the following code.

j = 0;
if (i = 3) {
    j = 5;
}
return j;

with uncommon trap we generate:

if (i != 3) {
    uncommon_trap;
}
return 5;

Witch class check (checkcast) in the following code we can inline/calls/access fields only for checked class without additional checks.

I am talking about generating class check with uncommon trap with GraphKit::gen_checkcast

vnkozlov avatar Jul 29 '22 22:07 vnkozlov

Note, instead of just return 5; in my example there could be following code which use j extensively which could be replaced with constant 5.

vnkozlov avatar Jul 29 '22 22:07 vnkozlov

Note, instead of just return 5; in my example there could be following code which use j extensively which could be replaced with constant 5.

I understand your concern. I added a test about it. I added a look-ahead function for this case. unstable_if_merge returns false when it sees that parser is going to merge a constant node. That will prevent from clobbering constant propagation. Like I mentioned before, this is a workaround but not a solution because real constant propagation (PhaseCCP) happens after parser. I can't detect all cases.

GraphKit::gen_checkcast is about bytecode checkcast and aastore. The reason of uncommon_trap that it generates is either Reason_array_check or Reason_class_check. I am still trying to understand the connection between my change and it.

navyxliu avatar Aug 01 '22 18:08 navyxliu

I would need to test performance with these changes. Did you ran any benchmarks yourself?

I ran Renaissance and SpecJBB. I didn't see regression. My understanding is that the concerning code shapes are not in hot paths in those benchmarks.

You are right about this EA case! We only see one allocation in speculative compilation. If we consider both paths, we end up with a phi node to merge them. It can be addressed in this effort

Regarding to this idea, I am thinking about the debate between predicable slow vs unpredictable fast. If we leave many runtime traps, we certainly maximize performance, but that is unpredictable fast. Essentially, we treat a method more like a trace rather than a CFG. If external conditions change, the program will have a performance hit and eventually go to predicable slow. Nowadays, micro-service architectures is popular and it's also very common that dozens of threads are sharing with a nmethod across multi-cores. The runtime trap is expensive and it may trigger a cascading failure. eg. A load-balancer will assign traffics to other servers when it realizes that the response time of a server is long. That behavior itself may cause a bigger problem. On the other side, predictable slow isn't necessarily slow. We still can work on performance improvement.

This effort is extensible. Currently, we have to trigger an unstable_if trap for a counted loop whose trip count is big. it's because method data indicates that it's likely to take. if we considered the trivial block which only contain 'break', we would avoid that too.

do (int i=0; ; ++i) {
 ...
 //
 if (i < 1_000_000) goto header
 else break;
}

navyxliu avatar Aug 15 '22 18:08 navyxliu

@navyxliu This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Sep 12 '22 18:09 bridgekeeper[bot]