openj9 Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86

This PR accelerates intrinsic candidates StringCoding.hasNegatives and StringCoding.countPositives on x86, the former on JDK 9-18 and the latter on JDK 19+.

This PR is incremental in a few ways.

The acceleration currently only delivers the desired performance boost for arrays of up to 31 elements. For arrays of 32 elements and longer, the acceleration still outperforms default OpenJ9, but can no longer keep up with HotSpot. For that, I will need to take advantage of larger SIMD instructions.
I have discovered a strange performance anomaly in OpenJ9 that causes it to perform significantly faster than expected for hasNegatives for arrays of 0-8 elements on JDK 19+. It performs so well for these short arrays that implementing my 'acceleration' would actually cause a performance regression there. While this anomaly is investigated, I will not be accelerating hasNegatives on JDK 19+.

In the interest of taking advantage of the performance boost as it currently stands for the 0.51 release, this PR will deliver these changes in their incremental state, with plans for another PR or two down the road to close the aforementioned gaps.

Feb 12 '25 14:02 dylanjtuttle

Paging @vijaysun-omr, @0xdaryl for review and @r30shah, @dchopra001 as requested for comparison to acceleration on Z

Feb 12 '25 14:02 dylanjtuttle

Looks fine to me from a quick review. I'll probably defer to @hzongaro for the review since he has much more direct awareness of this work than I do, and can easily review the inliner parts as well (that I skimmed over too).

Feb 12 '25 16:02 vijaysun-omr

It appears that the crash is due to a recent OMR commit and is therefore unrelated to my changes. I can build successfully with the openj9 branch of the openj9-omr repo, and can now confirm all of the performance testing checks out.

Feb 13 '25 15:02 dylanjtuttle

@0xdaryl @hzongaro I apparently still have a bug in here somewhere because I'm getting a failure in the JCL during the build, but I thought I'd open up for a second round of review in the meantime while I work on squashing it.

Feb 19 '25 16:02 dylanjtuttle

I fixed the bug! I squashed all of my review comment changes into one commit, just doing some more performance testing

Feb 21 '25 13:02 dylanjtuttle

@0xdaryl I believe I am ready for your review whenever you are!

Mar 03 '25 22:03 dylanjtuttle

Can you attach two logs with traceCG: one that shows your countPositives sequence and another for hasNegatives? I'd like to review the sequences when it is all laid out.

Mar 05 '25 15:03 0xdaryl

Is there a unit test that exercises the various lengths being specialized for countPositives and hasNegatives? In particular, I want to confirm that you checked each residue length (0-15) for correctness.

Mar 05 '25 16:03 0xdaryl

This intrinsification assumes the arrays are contiguous. You should prevent this optimization if discontiguous arrays are possible by the GC mode.

Mar 05 '25 16:03 0xdaryl

@0xdaryl Does that mean to disable the optimization if comp->om.canGenerateArraylets()?

Mar 05 '25 16:03 dylanjtuttle

It definitely would be smart to write a unit test that verifies the results in all of the different cases. I've spent so much time using the benchmarks the last few months that I've just been running those and printing the output and checking all of the cases manually, which is admittedly a bit silly. Perhaps I'll write a quick unit test that can do that work for me.

Mar 05 '25 16:03 dylanjtuttle

Does that mean to disable the optimization if comp->om.canGenerateArraylets()?

Yes

Mar 05 '25 16:03 0xdaryl

I included all of the tracelogs generated to cover all of the compilations: log_CG_countPositives.txt log_CG_countPositives_2.txt log_CG_hasNegatives.txt log_CG_hasNegatives_2.txt

Mar 05 '25 19:03 dylanjtuttle

Apologies for piling on the comments asynchronously, but another thing occurs to me.

In the JCL implementation, it is possible for the ba array to be NULL or for accesses to ba to produce an AIOB. You dont handle those cases in your intrinsic. To guard against that, you'd have to wrap the intrinsification with a check for a NULL ba and that the size of ba is within the length specified. If either of those cases could happen, then just call the helper. This is similar to what Konno-san did for the StringLatin1.inflate intrinsic.

Mar 06 '25 15:03 0xdaryl

In the JCL implementation, it is possible for the ba array to be NULL

Its true, but I think its not needed for known caller classes like String. I think in most (probably all) calls to countPositives in String class are protected by a call to checkBoundsOffCount(offset, length, bytes.length) and bytes.length should already have a null check.

Mar 06 '25 15:03 BradleyWood

I have implemented a simple iterative routine for handling the residual bits. Now that I have a unit test that checks all of the cases, I can confirm that it is correct. I will be doing some performance testing to see how this change affects our standing.

Mar 06 '25 22:03 dylanjtuttle

After doing some performance testing, the iterative approach is not going to cut it. I will try my xmm register + masking garbage bytes approach instead.

Mar 07 '25 15:03 dylanjtuttle

I have implemented a residue handling algorithm based on the original implementation, but with added logic to find the exact index. The code is getting pretty big and unwieldy, and I still need to do performance testing tomorrow, but it does work and it should be noticeably faster than the iterative method, which is a good place to build from at least.

Mar 12 '25 20:03 dylanjtuttle

Adding the exact index logic slows us down a little, but the performance is certainly still satisfactory, far outperforming OpenJ9. @hzongaro could I get a review on this new version?

Mar 13 '25 14:03 dylanjtuttle

@dylanjtuttle, sorry for the late request, but may I ask you to attach updated versions of the log files that you posted in an earlier comment?

Mar 14 '25 21:03 hzongaro

@hzongaro Done! Here they are:

log_countPositives_exactIndex_1.txt log_countPositives_exactIndex_2.txt log_hasNegatives_exactIndex.txt

Mar 17 '25 16:03 dylanjtuttle

Updated logs for the shift-based residual processing algorithm:

log_countPositives_shift.txt log_hasNegatives_shift.txt

(logs for last week's longer but faster algorithm discussed in scrum this morning can be found in this comment)

Mar 25 '25 19:03 dylanjtuttle

Reverted to non-exact residual handling. Logs:

log_countPositives_nonExact.txt log_hasNegatives_nonExact.txt

Apr 03 '25 18:04 dylanjtuttle

I seem to have a couple of problems at the moment. For some reason, in cases with negative bytes, my implementation seems to be unable to return an index larger than 128. I checked, and this behaviour appeared in my previous commit as well, so it wasn't introduced by my latest changes. There don't appear to be any relevant instructions constraining the size of the index register, so I'm not sure what could be causing this. It still satisfies the requirement of returning a value less than or equal to the index of the first negative, so it's not strictly incorrect, but obviously we should be able to do better than that.

I'm also getting a rather baffling JCL crash relating to an invalid CEN header during the build, which usually indicates I have a bug somewhere, but my unit tests pass correctly (aside from the 128 byte bug mentioned above that still satisfies the spec) and nothing seems out of the ordinary when running the debugger. I've got a Jenkins build running right now to try and eliminate any external factors that may be affecting things in my dev environment.

Apr 14 '25 21:04 dylanjtuttle

I fixed the JCL crash I mentioned in my previous comment, turns out I neglected to add a jump to the end label in the no negatives case which was causing the length to be subtracted by the offset. In my unit tests, I was always using an offset of 0 (whoops), which is why my unit tests were passing even though there was a bug.

Apr 15 '25 14:04 dylanjtuttle

Thanks to some help from @hzongaro, we figured out that the max index of 128 was also due to a bug in the unit test, which means we are all clear! In that case, @0xdaryl could you give this another review when you have a moment?

Apr 17 '25 14:04 dylanjtuttle

Logs for the newest version:

log_countPositives_20250421.txt log_hasNegatives_20250421.txt

Apr 21 '25 14:04 dylanjtuttle

See line endings CI failures

Apr 28 '25 16:04 0xdaryl

At @BradleyWood's suggestion, I've added SIMD encoding for PMOVMSKB to specify that we want the 16 byte version. If that encoding isn't supported on the architecture, I've reverted to a version that runs the loop with an 8 byte chunk instead of a 16 byte chunk.

Apr 29 '25 20:04 dylanjtuttle

@dylanjtuttle, may I ask you to squash your commits to a reasonable set, and then I will run PR testing?

May 05 '25 15:05 hzongaro

openj9 openj9 copied to clipboard

Accelerate StringCoding.hasNegatives for JDK 11, 17, StringCoding.countPositives for JDK 21+ on x86

openj9
openj9 copied to clipboard