jdk
jdk copied to clipboard
8320649: C2: Optimize scoped values
This change implements C2 optimizations for calls to ScopedValue.get(). Indeed, in:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
v2
can be replaced by v1
and the second call to get()
can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for scopedValue
is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a get()
call out of loop for a loop invariant scopedValue
should
also be legal in most cases.
ScopedValue.get()
is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the ScopedValue
object is in the cache then the result from get()
is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
ScopedValue
can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls ScopedValue.get()
is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.
To perform the optimizations, I added 3 new node types to C2:
-
the pair ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for the cache probe
-
a cfg node ScopedValueGetResultNode to help locate the result of the
get()
call in the IR graph.
In pseudo code, once the nodes are inserted, the code of a get()
is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)
In the snippet:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
Replacing v2
by v1
is then done by starting from the
ScopedValueGetResult
node for the second get()
and looking for a
dominating ScopedValueGetResult
for the same ScopedValue
object. When one is found, it is used as a replacement. Eliminating
the second get()
call is achieved by making
ScopedValueGetHitsInCache
always successful if there's a dominating
ScopedValueGetResult
and replacing its companion
ScopedValueGetLoadFromCache
by the dominating
ScopedValueGetResult
.
Hoisting a get()
out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating get()
and
removed the get()
from the loop body.
An important case, I think, is when profile predicts the slow case to
never taken. Then the code of get()
is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
trap();
}
res = ScopedValueGetResult(res)
The ScopedValueGetResult
doesn't help and is removed early one. The
optimization process then looks for a pair of
ScopedValueGetHitsInCache
/ScopedValueGetLoadFromCache
that
dominates the current pair of
ScopedValueGetHitsInCache
/ScopedValueGetLoadFromCache
and can
replace them. In that case, hoisting a ScopedValue.get()
can be
done by predication and I added special logic in predication for that.
Adding the new nodes to the graph when a ScopedValue.get()
call is
encountered is done in several steps:
1- inlining of ScopedValue.get()
is delayed and the call is enqueued
for late inlining.
2- Once the graph is fully constructed, for each call to
ScopedValue.get()
, a ScopedValueGetResult
is added between the
result of the call and its uses.
3- the call is then inlined by parsing the ScopedValue.get()
method
4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
ScopedValueGetHitsInCache
/ScopedValueGetLoadFromCache
nodes
There are a couple of reasons for steps 3 and 4:
-
As mentioned above probing the cache is a multi step process. Having only 2 nodes in a simple graph shape to represent it makes it easier to write robust optimizations
-
the subgraph for the method after parsing contains valuable pieces of information: profile data that captures which of the 2 locations in the cache is the most likely to causee a hit. Profile data is attached to the nodes.
Removal of redundant nodes is done during loop opts. The ScopedValue
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, ScopedValueGetResult
nodes are removed and
ScopedValueGetHitsInCache
/ScopedValueGetLoadFromCache
are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.
/cc hotspot-compiler
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)
Issue
- JDK-8320649: C2: Optimize scoped values (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966
Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16966
View PR using the GUI difftool:
$ git pr show -t 16966
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff