openj9
openj9 copied to clipboard
Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86
This PR accelerates intrinsic candidates StringCoding.hasNegatives and StringCoding.countPositives on x86, the former on JDK 9-18 and the latter on JDK 19+.
This PR is incremental in a few ways.
- The acceleration currently only delivers the desired performance boost for arrays of up to 31 elements. For arrays of 32 elements and longer, the acceleration still outperforms default OpenJ9, but can no longer keep up with HotSpot. For that, I will need to take advantage of larger SIMD instructions.
- I have discovered a strange performance anomaly in OpenJ9 that causes it to perform significantly faster than expected for
hasNegativesfor arrays of 0-8 elements on JDK 19+. It performs so well for these short arrays that implementing my 'acceleration' would actually cause a performance regression there. While this anomaly is investigated, I will not be acceleratinghasNegativeson JDK 19+.
In the interest of taking advantage of the performance boost as it currently stands for the 0.51 release, this PR will deliver these changes in their incremental state, with plans for another PR or two down the road to close the aforementioned gaps.
Paging @vijaysun-omr, @0xdaryl for review and @r30shah, @dchopra001 as requested for comparison to acceleration on Z
Looks fine to me from a quick review. I'll probably defer to @hzongaro for the review since he has much more direct awareness of this work than I do, and can easily review the inliner parts as well (that I skimmed over too).
It appears that the crash is due to a recent OMR commit and is therefore unrelated to my changes. I can build successfully with the openj9 branch of the openj9-omr repo, and can now confirm all of the performance testing checks out.
@0xdaryl @hzongaro I apparently still have a bug in here somewhere because I'm getting a failure in the JCL during the build, but I thought I'd open up for a second round of review in the meantime while I work on squashing it.
I fixed the bug! I squashed all of my review comment changes into one commit, just doing some more performance testing
@0xdaryl I believe I am ready for your review whenever you are!
Can you attach two logs with traceCG: one that shows your countPositives sequence and another for hasNegatives? I'd like to review the sequences when it is all laid out.
Is there a unit test that exercises the various lengths being specialized for countPositives and hasNegatives? In particular, I want to confirm that you checked each residue length (0-15) for correctness.
This intrinsification assumes the arrays are contiguous. You should prevent this optimization if discontiguous arrays are possible by the GC mode.
@0xdaryl Does that mean to disable the optimization if comp->om.canGenerateArraylets()?
It definitely would be smart to write a unit test that verifies the results in all of the different cases. I've spent so much time using the benchmarks the last few months that I've just been running those and printing the output and checking all of the cases manually, which is admittedly a bit silly. Perhaps I'll write a quick unit test that can do that work for me.
Does that mean to disable the optimization if comp->om.canGenerateArraylets()?
Yes
I included all of the tracelogs generated to cover all of the compilations: log_CG_countPositives.txt log_CG_countPositives_2.txt log_CG_hasNegatives.txt log_CG_hasNegatives_2.txt
Apologies for piling on the comments asynchronously, but another thing occurs to me.
In the JCL implementation, it is possible for the ba array to be NULL or for accesses to ba to produce an AIOB. You dont handle those cases in your intrinsic. To guard against that, you'd have to wrap the intrinsification with a check for a NULL ba and that the size of ba is within the length specified. If either of those cases could happen, then just call the helper. This is similar to what Konno-san did for the StringLatin1.inflate intrinsic.
In the JCL implementation, it is possible for the ba array to be NULL
Its true, but I think its not needed for known caller classes like String. I think in most (probably all) calls to countPositives in String class are protected by a call to checkBoundsOffCount(offset, length, bytes.length) and bytes.length should already have a null check.
I have implemented a simple iterative routine for handling the residual bits. Now that I have a unit test that checks all of the cases, I can confirm that it is correct. I will be doing some performance testing to see how this change affects our standing.
After doing some performance testing, the iterative approach is not going to cut it. I will try my xmm register + masking garbage bytes approach instead.
I have implemented a residue handling algorithm based on the original implementation, but with added logic to find the exact index. The code is getting pretty big and unwieldy, and I still need to do performance testing tomorrow, but it does work and it should be noticeably faster than the iterative method, which is a good place to build from at least.
Adding the exact index logic slows us down a little, but the performance is certainly still satisfactory, far outperforming OpenJ9. @hzongaro could I get a review on this new version?
@dylanjtuttle, sorry for the late request, but may I ask you to attach updated versions of the log files that you posted in an earlier comment?
@hzongaro Done! Here they are:
log_countPositives_exactIndex_1.txt log_countPositives_exactIndex_2.txt log_hasNegatives_exactIndex.txt
Updated logs for the shift-based residual processing algorithm:
log_countPositives_shift.txt log_hasNegatives_shift.txt
(logs for last week's longer but faster algorithm discussed in scrum this morning can be found in this comment)
Reverted to non-exact residual handling. Logs:
log_countPositives_nonExact.txt log_hasNegatives_nonExact.txt
I seem to have a couple of problems at the moment. For some reason, in cases with negative bytes, my implementation seems to be unable to return an index larger than 128. I checked, and this behaviour appeared in my previous commit as well, so it wasn't introduced by my latest changes. There don't appear to be any relevant instructions constraining the size of the index register, so I'm not sure what could be causing this. It still satisfies the requirement of returning a value less than or equal to the index of the first negative, so it's not strictly incorrect, but obviously we should be able to do better than that.
I'm also getting a rather baffling JCL crash relating to an invalid CEN header during the build, which usually indicates I have a bug somewhere, but my unit tests pass correctly (aside from the 128 byte bug mentioned above that still satisfies the spec) and nothing seems out of the ordinary when running the debugger. I've got a Jenkins build running right now to try and eliminate any external factors that may be affecting things in my dev environment.
I fixed the JCL crash I mentioned in my previous comment, turns out I neglected to add a jump to the end label in the no negatives case which was causing the length to be subtracted by the offset. In my unit tests, I was always using an offset of 0 (whoops), which is why my unit tests were passing even though there was a bug.
Thanks to some help from @hzongaro, we figured out that the max index of 128 was also due to a bug in the unit test, which means we are all clear! In that case, @0xdaryl could you give this another review when you have a moment?
See line endings CI failures
At @BradleyWood's suggestion, I've added SIMD encoding for PMOVMSKB to specify that we want the 16 byte version. If that encoding isn't supported on the architecture, I've reverted to a version that runs the loop with an 8 byte chunk instead of a 16 byte chunk.
@dylanjtuttle, may I ask you to squash your commits to a reasonable set, and then I will run PR testing?