jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8295159: DSO created with -ffast-math breaks Java floating-point arithmetic

Open theRealAph opened this issue 2 years ago • 24 comments

A bug in GCC causes shared libraries linked with -ffast-math to disable denormal arithmetic. This breaks Java's floating-point semantics.

The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522

One solution is to save and restore the floating-point control word around System.loadLibrary(). This isn't perfect, because some shared library might load another shared library at runtime, but it's a lot better than what we do now.

However, this fix is not complete. dlopen() is called from many places in the JDK. I guess the best thing to do is find and wrap them all. I'd like to hear people's opinions.


Progress

  • [x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8295159: DSO created with -ffast-math breaks Java floating-point arithmetic

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10661/head:pull/10661
$ git checkout pull/10661

Update a local copy of the PR:
$ git checkout pull/10661
$ git pull https://git.openjdk.org/jdk pull/10661/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10661

View PR using the GUI difftool:
$ git pr show -t 10661

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10661.diff

theRealAph avatar Oct 11 '22 16:10 theRealAph

:wave: Welcome back aph! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Oct 11 '22 16:10 bridgekeeper[bot]

@theRealAph The following labels will be automatically applied to this pull request:

  • build
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Oct 11 '22 16:10 openjdk[bot]

Also: this patch is Linux-only. I'll ask for help from build experts to make the tests GCC-only; it's not clear to me how.

theRealAph avatar Oct 11 '22 16:10 theRealAph

It seems strange to me that the native library part is here:

test/hotspot/jtreg/runtime/jni/TestDenormalFloat/libfast-math.c

and the two test files are here:

test/hotspot/jtreg/compiler/floatingpoint/TestDenormalDouble.java test/hotspot/jtreg/compiler/floatingpoint/TestDenormalFloat.java

And the two tests don't have "@run main/native"... Maybe I'm missing something about what you're trying to test here.

dcubed-ojdk avatar Oct 11 '22 18:10 dcubed-ojdk

This appears to be a 10 year old bug in gcc. Have we ever had any issues reported because of this? Inserting a workaround now seems rather late. Any FP-using native code can potentially break Java's FP semantics if it messes with the FPU control word (ref the old Borland compilers).

Shouldn't any workaround only be needed for the internals of System.loadLibrary as other JDK usages of dlopen should know what they are opening and that they are libraries that don't have this problem?

dholmes-ora avatar Oct 12 '22 00:10 dholmes-ora

Also: this patch is Linux-only. I'll ask for help from build experts to make the tests GCC-only; it's not clear to me how.

@theRealAph We currently have a more or less 1-to-1 relationship between OS and compiler. From a portability perspective this is not ideal, but it is also hard to keep some kind of theoretical barrier when it never is tested.

Are you worried that this is a problem that occurs with gcc on other platforms as well, or that it occurs on linux with other compilers than gcc?

We do not support gcc on macos, windows or aix, which leaves linux as the only gcc platform in the mainline (ports in separate repos will have to handle this themselves). We do support using clang on linux instead of gcc (but do not test it regularly). But my impression here is that this is more of a gcc-problem than a linux problem..?

magicus avatar Oct 12 '22 11:10 magicus

@theRealAph This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8295159: DSO created with -ffast-math breaks Java floating-point arithmetic

Reviewed-by: ihse, dholmes, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • 1a21c1a783d64ca0930c358c06a43975f96ffac6: 8318736: com/sun/jdi/JdwpOnThrowTest.java failed with "transport error 202: bind failed: Address already in use"
  • 81db1721d4fac954003441ca2c3c29b0998d310d: 8318955: Add ReleaseIntArrayElements in Java_sun_awt_X11_XlibWrapper_SetBitmapShape XlbWrapper.c to early return
  • be01caf30de5a70684fab25b02f2c6ab346a60cb: 8319323: FFM: Harmonize the @throws tags in the javadocs
  • ec79ab4b3cd89c2c0a9c8550cd62433bd6d45266: 8319268: Build failure with GCC8.3.1 after 8313643
  • c788160f8acea7b58b54ad857b601bb7ffb53f8e: 8296240: Augment discussion of test tiers in doc/testing.md
  • ffaecd4aa23ca15e765784858da82b632c72dfc2: 8315364: Assert thread state invariant for JFR stack trace capture
  • 3b65b8797a0798474947d38d3facd17b3e89c602: 8316028: Update FreeType to 2.13.2
  • 9dc40ba48ec15d5775537c4c2224ee5bb563629a: 8319195: Move most tier 1 vector API regression tests to tier 3

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Oct 12 '22 11:10 openjdk[bot]

It seems strange to me that the native library part is here:

test/hotspot/jtreg/runtime/jni/TestDenormalFloat/libfast-math.c

and the two test files are here:

test/hotspot/jtreg/compiler/floatingpoint/TestDenormalDouble.java test/hotspot/jtreg/compiler/floatingpoint/TestDenormalFloat.java

I already moved it.

And the two tests don't have "@run main/native"... Maybe I'm missing something about what you're trying to test here.

Will fix.

theRealAph avatar Oct 12 '22 13:10 theRealAph

This appears to be a 10 year old bug in gcc. Have we ever had any issues reported because of this?

That's what is so insidious about messing with the floating-point control word: all that happens is people get slightly inaccurate results, which they don't notice.

Inserting a workaround now seems rather late. Any FP-using native code can potentially break Java's FP semantics if it messes with the FPU control word (ref the old Borland compilers).

Indeed, but there seem to be a number of libraries out there compiled with -ffast-math. This is the kind of thing people do with highly performance-critical stuff like ray tracing, etc. But we don't want it to break Java.

Shouldn't any workaround only be needed for the internals of System.loadLibrary as other JDK usages of dlopen should know what they are opening and that they are libraries that don't have this problem?

Maybe, but sometimes we use system- or user-provided librries.

theRealAph avatar Oct 12 '22 13:10 theRealAph

Also: this patch is Linux-only. I'll ask for help from build experts to make the tests GCC-only; it's not clear to me how.

@theRealAph We currently have a more or less 1-to-1 relationship between OS and compiler. From a portability perspective this is not ideal, but it is also hard to keep some kind of theoretical barrier when it never is tested.

I see.

Are you worried that this is a problem that occurs with gcc on other platforms as well, or that it occurs on linux with other compilers than gcc?

The former.

We do not support gcc on macos, windows or aix, which leaves linux as the only gcc platform in the mainline (ports in separate repos will have to handle this themselves). We do support using clang on linux instead of gcc (but do not test it regularly). But my impression here is that this is more of a gcc-problem than a linux problem..?

Probably, yes. I'm happy just to fix it on Linux.

theRealAph avatar Oct 12 '22 13:10 theRealAph

We do not support gcc on macos, windows or aix, which leaves linux as the only gcc platform in the mainline (ports in separate repos will have to handle this themselves). We do support using clang on linux instead of gcc (but do not test it regularly). But my impression here is that this is more of a gcc-problem than a linux problem..?

Indeed. The problem is that we have no idea what compiler was used to compile the libraries we call, so I'd like to be conservative. The increase in runtime due to saving and restoring the floating-point environment is very low, given that dlopen() is a pretty expensive operation.

theRealAph avatar Oct 12 '22 14:10 theRealAph

So you want to provide this dlopen wrapper for other platforms as well?

magicus avatar Oct 12 '22 14:10 magicus

So you want to provide this dlopen wrapper for other platforms as well?

I'm not sure. It sounds to me like a Linux-only patch would do it, but I think BSD uses GCC sometimes. (Open/FreeBSD is downstream of OpenJDK, and I'd love to get it into mainline, but getting BSD contributors to sign OCA is hard.)

theRealAph avatar Oct 12 '22 14:10 theRealAph

This appears to be a 10 year old bug in gcc. Have we ever had any issues reported because of this?

That's what is so insidious about messing with the floating-point control word: all that happens is people get slightly inaccurate results, which they don't notice.

e.g. https://github.com/gevent/gevent/pull/1820 is for Python (SciPy) and adversely affects time to converge. I have no idea if other Java users have similar problems, but it's a bug that I'd like to go away.

theRealAph avatar Oct 12 '22 15:10 theRealAph

Isn't it an illustration of a more general problem we have with native code where it can mess with FP environment at any time?

We already have similar problems with MXCSR register and provide verification logic (part of -Xcheck:jni) to catch modifications and support conditional restoration of MXCSR register on x86_64. x86_32 validates x87 control word when -Xcheck:jni is enabled.

Should we do something similar here instead?

iwanowww avatar Oct 12 '22 17:10 iwanowww

For upcalls on non-Windows platforms, we also save MXCSR and restore it after the call, and load a set standard value for the Java code that's about to be executed. (I'm not sure why this isn't done on Windows, tbh)

Relevant code for JNI is here:

  • Downcalls: https://github.com/openjdk/jdk/blob/1961e81e02e707cd0c8241aa3af6ddabf7668589/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L5002
  • Upcalls: https://github.com/openjdk/jdk/blob/1961e81e02e707cd0c8241aa3af6ddabf7668589/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp#L262

FPCSR is only handled on non _LP64 it looks like.

I agree with Vladimir that this seems like a general problem of foreign code potentially messing with control bits (in theory, foreign code could violate its ABI in other ways as well). It seems that both major C/C++ x64 ABIs (1, 2, 3) treat the control bits as non-volatile, so the callee should preserve them. This is in theory a choice of a particular ABI, but I think in general we can assume foreign code does not modify the control bits. Though, we never know for sure of course, and I suppose this is where RestoreMXCSROnJNICalls comes in. There's no equivalent flag for FPCSR atm AFAICS, so the answer there seems to be "just don't do it".

Following that logic: from our perspective dlopen violates its ABI in certain cases. Preserving the control bits across calls to dlopen seems like a pragmatic solution. I'm not sure how important it is to have an opt-in for the current (broken) behavior...

JornVernee avatar Oct 12 '22 20:10 JornVernee

Isn't it an illustration of a more general problem we have with native code where it can mess with FP environment at any time?

We already have similar problems with MXCSR register and provide verification logic (part of -Xcheck:jni) to catch modifications and support conditional restoration of MXCSR register on x86_64. x86_32 validates x87 control word when -Xcheck:jni is enabled.

Should we do something similar here instead?

I tend to agree. As others have observed, a dlopen call (or something with same nefarious behavior) could also happen inside JNI code. But I think the interesting (and perhaps surprising) part here is that, from the perspective of the developer, no native code has executed - only a library has been loaded (via System::loadLibrary).

Note also that this specific problem is triggered by dlopen itself, because certain libraries might have some "bad" (from the perspective of JVM) initialization code. But since we're talking about JNI, JNI_OnLoad is another potential source of problem, as its native code is executed as soon as the library is loaded (and that, too, can leave the JVM in a bad state).

mcimadamore avatar Oct 12 '22 20:10 mcimadamore

Note also that this specific problem is triggered by dlopen itself, because certain libraries might have some "bad" (from the perspective of JVM) initialization code.

... and it leads to another question whether the JVM itself is amenable to such problems. Do we need to sanitize the environment when returning from calls into the JVM?

iwanowww avatar Oct 12 '22 21:10 iwanowww

Isn't it an illustration of a more general problem we have with native code where it can mess with FP environment at any time?

Yes.

We already have similar problems with MXCSR register and provide verification logic (part of -Xcheck:jni) to catch modifications and support conditional restoration of MXCSR register on x86_64. x86_32 validates x87 control word when -Xcheck:jni is enabled.

Should we do something similar here instead?

The problem is that this bug is very insidious: the user probably doesn't know that there's anything wrong, and almost certainly has no idea that it's anything to do with JNI.

Saving and restoring the floating-point environment across dlopen() is a compromise between adding extra code at JNI callouts, which can be expensive, and shifting the burden to the user.

theRealAph avatar Oct 13 '22 07:10 theRealAph

Following that logic: from our perspective dlopen violates its ABI in certain cases. Preserving the control bits across calls to dlopen seems like a pragmatic solution. I'm not sure how important it is to have an opt-in for the current (broken) behavior...

Not at all, I'd have thought.

theRealAph avatar Oct 13 '22 07:10 theRealAph

Here's another possibility: save the FP environment over up- and down-calls, and restore it if the control bits have changed. We can do that in a portable way, overriding it in a more performant way with arch-specific code.

theRealAph avatar Oct 13 '22 08:10 theRealAph

The problem is that this bug is very insidious: the user probably doesn't know that there's anything wrong, and almost certainly has no idea that it's anything to do with JNI.

I'm still trying to grasp why the current problem is something different from what we experienced before.

Some platforms (x86_32 and AArch32) provide a way to restore FP environment, but it is turned off by default. Why x86_64 case is different, so it requires the logic to be turned on by default? Does the same reasoning apply to those platforms as well and we want to have AlwaysRestoreFPU turned on by default?

$ grep -r AlwaysRestoreFPU src/hotspot/
src/hotspot//cpu/x86/templateInterpreterGenerator_x86.cpp:  if (AlwaysRestoreFPU) {
src/hotspot//cpu/x86/sharedRuntime_x86_32.cpp:  if (AlwaysRestoreFPU) {
src/hotspot//cpu/arm/sharedRuntime_arm.cpp:  if (AlwaysRestoreFPU) {
src/hotspot//cpu/arm/templateInterpreterGenerator_arm.cpp:  if (AlwaysRestoreFPU) {
src/hotspot//share/runtime/globals.hpp:  product(bool, AlwaysRestoreFPU, false,                                    \

iwanowww avatar Oct 13 '22 17:10 iwanowww

Mailing list message from Andrew Haley on build-dev:

On 10/13/22 19:13, Vladimir Ivanov wrote:

On Thu, 13 Oct 2022 07:47:20 GMT, Andrew Haley <aph at openjdk.org> wrote:

The problem is that this bug is very insidious: the user probably doesn't know that there's anything wrong, and almost certainly has no idea that it's anything to do with JNI.

I'm still trying to grasp why the current problem is something different from what we experienced before.

I don't know anything about what we experienced before. In any case, I'm not sure that precedent applies here: the problem is that there are shared libraries in use which break IEEE semantics. This is an insidious bug, and my proposed fix costs almost nothing.

Some platforms (x86_32 and AArch32) provide a way to restore FP environment, but it is turned off by default. Why x86_64 case is different, so it requires the logic to be turned on by default? Does the same reasoning apply to those platforms as well and we want to have `AlwaysRestoreFPU` turned on by default?

My patch is not in any way x86_64-specific. At present it is Linux-specific, but I have my doubts that it should be.

x86-32 has a very specific problem: its default floating-point precision is AFAIK extended float, which does not meet Java specs. Also, restoring the FPU after every JNI call is expensive.

-- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge[bot] avatar Oct 14 '22 12:10 mlbridge[bot]

FTR I did an exercise in source code archeology and here are my findings.

The origin of AlwaysRestoreFPU-related code (both in x86-32 and arm-specific code) can be traced back to JDK-6487931 and JDK-6550813. Though both issues manifested as JVM crashes, the underlying problem was identified as FPU control word corruption by native code.

The regression test does trigger the corruption from a JNI call (using either _FPU_SETCW [1] or _controlfp [2]), but it was deliberately limited to x86-32.

Based on that, I conclude that the problem with FP environment corruption by native code was known before, but an opt-in solution was chosen.

(Frankly speaking, I don't know why an opt-in solution was considered sufficient. Maybe because it was erroneously believed it can only lead to a crash at runtime?)

Considering we are now aware about insidious nature of the problem (silent result corruption), I'm inclined to propose either to turn on AlwaysRestoreFPU by default (and provide implementation on platforms where it is missed) or, at least, catch FPU control word corruption on native->java transitions and crash the JVM advertising -XX:+AlwaysRestoreFPU as a solution.

[1] https://man7.org/linux/man-pages/man3/__setfpucw.3.html [2] https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/control87-controlfp-control87-2?view=msvc-170

iwanowww avatar Oct 17 '22 19:10 iwanowww

Correction: I was under the false impression the FP-state was shared at the process level but I've been told it is per-thread. So the scope of the problem is much narrower.

The pre-existing "fixes" in this area were an opt-in band-aid while people migrated away from using problematic native libraries, or else got them fixed.

dholmes-ora avatar Oct 17 '22 22:10 dholmes-ora

For the benefit of the mailing list my previous comment has been edited/corrected:

Correction: I was under the false impression the FP-state was shared at the process level but I've been told it is per-thread. So the scope of the problem is much narrower.

dholmes-ora avatar Oct 18 '22 01:10 dholmes-ora

Some additional info to consider: the library modifies only MXCSR register and -XX:+RestoreMXCSROnJNICalls makes the test to pass.

$ objdump -D libfast-math.so
...
0000000000000560 <set_fast_math>:
 560:   f3 0f 1e fa             endbr64
 564:   0f ae 5c 24 fc          stmxcsr -0x4(%rsp)
 569:   81 4c 24 fc 40 80 00    orl    $0x8040,-0x4(%rsp)
 570:   00
 571:   0f ae 54 24 fc          ldmxcsr -0x4(%rsp)
 576:   c3                      ret

iwanowww avatar Oct 18 '22 17:10 iwanowww

On non-windows x86-64 the JVM saves/restores (and adjusts if needed) MXCSR inside call stub as mandated by the ABI: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp#L262-L273

x87 is not used in x86-64-specific code anymore, so x87-specific part of FP environment (FPSR and FPCS registers) is not relevant to JVM anymore there.

iwanowww avatar Oct 18 '22 17:10 iwanowww

So, IMO the discussion boils down to how we want a misbehaving native library to be handled by the JVM.

The ABI lists MXCSR as a callee-saved register, so there's nothing wrong on JVM side from that perspective.

From a quality of implementation perspective though, JVM could do a better job at catching broken libraries. Of course, there are numerous ways for a native code to break the JVM, but in this particular case, it looks trivial to catch the problem. The question is how much overhead we can afford to introduce for that. Whether it should be an opt-in solution (e.g., -Xcheck:jni or -XX:+AlwaysRestoreFPU/-XX:+RestoreMXCSROnJNICalls), opt-out (unconditionally recover or report an error when FP env is corrupted, optionally providing a way to turn it off), or apply a band-aid fix just to fix the immediate problem with GCC's fast-math mode.

I'd like to dissuade from going with just a band-aid fix (we already went through that multiple times with different level of success) and try to improve the overall experience JVM provides. It feels like just pushing the problem further away and it would be very unfortunate to repeat the very same exercise in the future.

My preferred solution would be to automatically detect the corruption and restore MXCSR register across a JNI call, but if it turns out to be too expensive, JVM could check for MXCSR register corruption after every JNI call and crash issuing a message with diagnostic details about where corruption happened (info about library and entry) offering to turn on -XX:+AlwaysRestoreFPU/-XX:+RestoreMXCSROnJNICalls as a stop-the-gap solution. It would send users a clear signal there's something wrong with their code/environment, but still giving them an option to workaround the problem while fixing the issue.

Saying that, I'd like to stress that I'm perfectly fine with addressing the general issue of misbehaving native libraries separately (if we agree it's worth it) and I trust @dholmes-ora and @theRealAph to choose the most appropriate fix for this particular bug.

iwanowww avatar Oct 18 '22 18:10 iwanowww