jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8320649: C2: Optimize scoped values

Open rwestrel opened this issue 1 year ago • 60 comments

This change implements C2 optimizations for calls to ScopedValue.get(). Indeed, in:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

v2 can be replaced by v1 and the second call to get() can be optimized out. That's true whatever is between the 2 calls unless a new mapping for scopedValue is created in between (when that happens no optimizations is performed for the method being compiled). Hoisting a get() call out of loop for a loop invariant scopedValue should also be legal in most cases.

ScopedValue.get() is implemented in java code as a 2 step process. A cache is attached to the current thread object. If the ScopedValue object is in the cache then the result from get() is read from there. Otherwise a slow call is performed that also inserts the mapping in the cache. The cache itself is lazily allocated. One ScopedValue can be hashed to 2 different indexes in the cache. On a cache probe, both indexes are checked. As a consequence, the process of probing the cache is a multi step process (check if the cache is present, check first index, check second index if first index failed). If the cache is populated early on, then when the method that calls ScopedValue.get() is compiled, profile reports the slow path as never taken and only the read from the cache is compiled.

To perform the optimizations, I added 3 new node types to C2:

  • the pair ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for the cache probe

  • a cfg node ScopedValueGetResultNode to help locate the result of the get() call in the IR graph.

In pseudo code, once the nodes are inserted, the code of a get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)

In the snippet:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

Replacing v2 by v1 is then done by starting from the ScopedValueGetResult node for the second get() and looking for a dominating ScopedValueGetResult for the same ScopedValue object. When one is found, it is used as a replacement. Eliminating the second get() call is achieved by making ScopedValueGetHitsInCache always successful if there's a dominating ScopedValueGetResult and replacing its companion ScopedValueGetLoadFromCache by the dominating ScopedValueGetResult.

Hoisting a get() out of loop is achieved by peeling one iteration of the loop. The optimization above then finds a dominating get() and removed the get() from the loop body.

An important case, I think, is when profile predicts the slow case to never taken. Then the code of get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  trap();
}
res = ScopedValueGetResult(res)

The ScopedValueGetResult doesn't help and is removed early one. The optimization process then looks for a pair of ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache that dominates the current pair of ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache and can replace them. In that case, hoisting a ScopedValue.get() can be done by predication and I added special logic in predication for that.

Adding the new nodes to the graph when a ScopedValue.get() call is encountered is done in several steps:

1- inlining of ScopedValue.get() is delayed and the call is enqueued for late inlining.

2- Once the graph is fully constructed, for each call to ScopedValue.get(), a ScopedValueGetResult is added between the result of the call and its uses.

3- the call is then inlined by parsing the ScopedValue.get() method

4- finally the subgraph that results is pattern matched and the pieces required to perform the cache probe are extracted and attached to new ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache nodes

There are a couple of reasons for steps 3 and 4:

  • As mentioned above probing the cache is a multi step process. Having only 2 nodes in a simple graph shape to represent it makes it easier to write robust optimizations

  • the subgraph for the method after parsing contains valuable pieces of information: profile data that captures which of the 2 locations in the cache is the most likely to causee a hit. Profile data is attached to the nodes.

Removal of redundant nodes is done during loop opts. The ScopedValue nodes are then expanded. That also happens during loop opts because once expansion is over, there are opportunities for further optimizations/clean up that can only happens during loop opts. During expansion, ScopedValueGetResult nodes are removed and ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache are expanded to the multi step process of probing the cache. Profile data attached to the nodes are used to assign correct frequencies/counts to the If nodes. Of the 2 locations in the cache that are tested, the one that's the most likely to see a hit (from profile data) is done first.

/cc hotspot-compiler


Progress

  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

  • JDK-8320649: C2: Optimize scoped values (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966

Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16966

View PR using the GUI difftool:
$ git pr show -t 16966

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff

Webrev

Link to Webrev Comment

rwestrel avatar Dec 05 '23 08:12 rwestrel