jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded

Open dhanalla opened this issue 1 year ago • 13 comments

In the debug build, the assert is triggered during the parsing (before Code_Gen). In the Release build, however, the compilation bails out at Compile::check_node_count() during the code generation phase and completes execution without any issues.

When I commented out the assert(C->live_nodes() <= C->max_node_limit()), both the debug and release builds exhibited the same behavior: the compilation bails out during code_gen after building the ideal graph with more than 80K nodes.

The proposed fix will check the live node count and bail out during compilation while building the graph for scalarization of the elements in the array when the live node count crosses the limit of 80K, instead of unnecessarily building the entire graph and bailing out in code_gen.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded (Bug - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20504/head:pull/20504
$ git checkout pull/20504

Update a local copy of the PR:
$ git checkout pull/20504
$ git pull https://git.openjdk.org/jdk.git pull/20504/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20504

View PR using the GUI difftool:
$ git pr show -t 20504

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20504.diff

Webrev

Link to Webrev Comment

dhanalla avatar Aug 08 '24 00:08 dhanalla

:wave: Welcome back dhanalla! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Aug 08 '24 00:08 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Aug 08 '24 00:08 openjdk[bot]

@dhanalla The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Aug 08 '24 00:08 openjdk[bot]

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert.

For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit).

You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

chhagedorn avatar Aug 08 '24 06:08 chhagedorn

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert.

For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit).

You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

dhanalla avatar Aug 16 '24 21:08 dhanalla

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes?

chhagedorn avatar Aug 19 '24 10:08 chhagedorn

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes?

The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes.

dhanalla avatar Aug 20 '24 23:08 dhanalla

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes?

The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes.

I see, thanks for explaining the test behavior.

As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

I think that is just a mitigation which makes it less likelier. You could probably still just come up with a test with a lot more arrays of size 1024 and hit the node limit again.

I suggest to first extract a simpler minimal test case which isolates the problem. Then you can also play around with different values for EliminateAllocationArraySizeLimit. I could imagine that you can also trigger this problem with just one huge array when you set the limit large enough. This could make it easier to understand and explain where the nodes are exactly created, what kind of nodes those are etc. Once we know that, we can try to implement a bailout right there which is independent of how big EliminateAllocationArraySizeLimit is.

chhagedorn avatar Aug 21 '24 07:08 chhagedorn

@dhanalla This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Sep 18 '24 08:09 bridgekeeper[bot]

adding a comment to keep the PR active.

dhanalla avatar Sep 18 '24 16:09 dhanalla

@dhanalla This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Oct 16 '24 19:10 bridgekeeper[bot]

Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in JDK-8256934, we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert. For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:

  1. We have a real bug and by fixing it, we no longer create this many nodes.
  2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
  3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).

Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit). You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.

Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes?

The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes.

I see, thanks for explaining the test behavior.

As an alternative solution, could we consider limiting the JVM option EliminateAllocationArraySizeLimit (in c2_globals.hpp) to a range between 0 and 1024, instead of the current range of 0 to max_jint, as the upper limit of max_jint may not be practical?

I think that is just a mitigation which makes it less likelier. You could probably still just come up with a test with a lot more arrays of size 1024 and hit the node limit again.

I suggest to first extract a simpler minimal test case which isolates the problem. Then you can also play around with different values for EliminateAllocationArraySizeLimit. I could imagine that you can also trigger this problem with just one huge array when you set the limit large enough. This could make it easier to understand and explain where the nodes are exactly created, what kind of nodes those are etc. Once we know that, we can try to implement a bailout right there which is independent of how big EliminateAllocationArraySizeLimit is.

Thanks @chhagedorn, the graph is growing to more than 80K nodes as two nodes (a set of phi nodes) are added for each element in the array during scalarization process. I minimized the test case to create a single array of size 48K. The proposed fix will check the live node count and bail out during compilation while building the graph for scalarization of the elements in the array when the live node count crosses the limit of 80K, instead of unnecessarily building the entire graph and bailing in code_gen.

dhanalla avatar Oct 21 '24 17:10 dhanalla

Thanks @chhagedorn, I have addressed your feedback. Could you please review the latest changes.

dhanalla avatar Nov 25 '24 21:11 dhanalla

@dhanalla This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Dec 25 '24 14:12 bridgekeeper[bot]

@dhanalla This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar Jan 22 '25 19:01 bridgekeeper[bot]

/open According to the latest comments on bug JDK-8315916, two more people have reported the issue.

dhanalla avatar Apr 08 '25 05:04 dhanalla

@dhanalla This pull request is now open

openjdk[bot] avatar Apr 08 '25 05:04 openjdk[bot]

@dhanalla Are you still making changes or is this ready to review? (if not ready just make it a draft ;) )

eme64 avatar Apr 08 '25 16:04 eme64

@dhanalla Are you still making changes or is this ready to review? (if not ready just make it a draft ;) )

@eme64, This is ready for review.

dhanalla avatar Apr 08 '25 16:04 dhanalla

@dhanalla Do you want us to continue reviewing? It is usually good to ping people again after making changes. Otherwise, we don't know if you are still working on it and we should wait.

eme64 avatar Apr 22 '25 08:04 eme64

@dhanalla Do you want us to continue reviewing? It is usually good to ping people again after making changes. Otherwise, we don't know if you are still working on it and we should wait. @eme64 yes, please review.

dhanalla avatar Apr 22 '25 14:04 dhanalla

@dhanalla I see that you have had a conversation with @chhagedorn here, where you explained more details about what exactly goes wrong. Can you please update the PR description with these details? Generally, that makes it much easier to review, then the reviewers don't need to read through the whole conversation and figure out what is now stale (things you already applied) and what is still an active conversation. While you are at it, you can also update the description on JIRA.

eme64 avatar Apr 22 '25 15:04 eme64

FYI: github actions tells me that your added regression test is failing TestScalarizeBailout.java, are you aware of that?

https://github.com/dhanalla/jdk/actions/runs/14364146583/job/40274556990

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/runner/work/jdk/jdk/src/hotspot/share/opto/node.cpp:78), pid=8112, tid=8129
#  assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded limit
#
# JRE version: OpenJDK Runtime Environment (24.0) (fastdebug build 24-internal-dhanalla-a8cb47d6f392524d5b96528e398ebfc15847b693)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 24-internal-dhanalla-a8cb47d6f392524d5b96528e398ebfc15847b693, compiled mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x1490e77]  Node::verify_construction()+0x1a7
#
# CreateCoredumpOnCrash turned off, no core file dumped
#
# An error report file with more information is saved as:
# /home/runner/work/jdk/jdk/build/run-test-prebuilt/test-support/jtreg_test_hotspot_jtreg_tier1_compiler_2/scratch/0/hs_err_pid8112.log
#
# Compiler replay data is saved as:
# /home/runner/work/jdk/jdk/build/run-test-prebuilt/test-support/jtreg_test_hotspot_jtreg_tier1_compiler_2/scratch/0/replay_pid8112.log

eme64 avatar Apr 22 '25 15:04 eme64

@dhanalla You now first perform the transformation, and then fail because there are too many nodes. At that point, the compilation is basically in a bad state and cannot be finished.

Is there no alternative where we could check first if we would exceed the node limit, and then just avoid scalarizing such very large arrays, but continue with the compilation? It seems to be a shame to give up on escape analysis entirely, if it maybe was just a single array. But maybe @vnkozlov should say more about this, I only have a very rudimentary understanding of escape analysis.

eme64 avatar Apr 22 '25 15:04 eme64

@dhanalla I see that you have had a conversation with @chhagedorn here, where you explained more details about what exactly goes wrong. Can you please update the PR description with these details? Generally, that makes it much easier to review, then the reviewers don't need to read through the whole conversation and figure out what is now stale (things you already applied) and what is still an active conversation. While you are at it, you can also update the description on JIRA.

You're right that in the current implementation, we begin the scalarization process and only bail out once the live node count has already exceeded the limit. At that point, the graph is indeed partially transformed, which is why we fall back to recompilation without EA to ensure a safe and consistent compilation state. Accurately predicting the number of nodes before transformation is difficult due to the variety of types and structures involved — each element can lead to multiple nodes (e.g., phi nodes, loads/stores, etc.), and the graph can grow non-linearly depending on how the array is used. However, I agree that giving up entirely on EA just because of one large array seems like an overly conservative fallback, especially if the rest of the method would still benefit from EA.

dhanalla avatar Apr 23 '25 17:04 dhanalla

@dhanalla I see that you have had a conversation with @chhagedorn here, where you explained more details about what exactly goes wrong. Can you please update the PR description with these details? Generally, that makes it much easier to review, then the reviewers don't need to read through the whole conversation and figure out what is now stale (things you already applied) and what is still an active conversation. While you are at it, you can also update the description on JIRA.

You're right that in the current implementation, we begin the scalarization process and only bail out once the live node count has already exceeded the limit. At that point, the graph is indeed partially transformed, which is why we fall back to recompilation without EA to ensure a safe and consistent compilation state. Accurately predicting the number of nodes before transformation is difficult due to the variety of types and structures involved — each element can lead to multiple nodes (e.g., phi nodes, loads/stores, etc.), and the graph can grow non-linearly depending on how the array is used. However, I agree that giving up entirely on EA just because of one large array seems like an overly conservative fallback, especially if the rest of the method would still benefit from EA.

@eme64 If this answers your question, this PR is ready for review

dhanalla avatar Apr 23 '25 17:04 dhanalla

@dhanalla I see. @chhagedorn and I quickly looked through the code, and it seems there are other bailouts that use the FudgeFactor.

It also seems that you need an unreasonably high EliminateAllocationArraySizeLimit, and so this failure should never actually happen normally, right? Or is it possible to reproduce the same bug with a lower EliminateAllocationArraySizeLimit but just more allocations? If so, it would be good if you added such test cases.

Is it possible to exceed the node limit with the default EliminateAllocationArraySizeLimit, i.e. so that we would hit the assert before your changes, and bailout after your changes?

I have two worries, and maybe @vnkozlov can say something here:

  • By the time we check the condition and bail out, we may have allocated a lot of nodes, and possibly be far over the node limit. That means we already used a lot of memory and time. How bad can this get?
  • And as discussed above: we could have done EA partially, until getting close to the node limit, and then not do allocation elimination on the remaining allocations. That would be a partial benefit, which we do not have if we recompile without EA.

eme64 avatar Apr 24 '25 08:04 eme64

@dhanalla I see. @chhagedorn and I quickly looked through the code, and it seems there are other bailouts that use the FudgeFactor.

It also seems that you need an unreasonably high EliminateAllocationArraySizeLimit, and so this failure should never actually happen normally, right? Or is it possible to reproduce the same bug with a lower EliminateAllocationArraySizeLimit but just more allocations? If so, it would be good if you added such test cases.

Is it possible to exceed the node limit with the default EliminateAllocationArraySizeLimit, i.e. so that we would hit the assert before your changes, and bailout after your changes?

I have two worries, and maybe @vnkozlov can say something here:

  • By the time we check the condition and bail out, we may have allocated a lot of nodes, and possibly be far over the node limit. That means we already used a lot of memory and time. How bad can this get?
  • And as discussed above: we could have done EA partially, until getting close to the node limit, and then not do allocation elimination on the remaining allocations. That would be a partial benefit, which we do not have if we recompile without EA.

@eme64 Yes, that’s generally accurate. Under typical usage conditions and with the default EliminateAllocationArraySizeLimit value of 64, this assertion failure is not expected to occur. It appears that the bug is challenging to reproduce with the default values.

dhanalla avatar Apr 30 '25 22:04 dhanalla

It appears that the bug is challenging to reproduce with the default values.

@dhanalla Would the bug still trigger with two arrays of half the size?

eme64 avatar May 06 '25 16:05 eme64