mesa
mesa copied to clipboard
Test results depend on `auto_diff` (thus `sympy`) implementation
When dealing with the formatting of auto_diff
, I relinted the python code as well and generated new auto_diff_*.f90
using an up to date python environment.
This fails 3 testcases, see here.
given that only the generated auto_diff_*.f90
files changed, the fails must come from different implementations of the functions as generated by sympy
. I used sympy-1.11.1
from conda-forge
, and would like to know what version built the current auto_diff_*.f90
files.
subtle changes are eg how sympy
simplifies expressions:
q0 = powm1(sqrt(x%val - 1))*powm1(sqrt(x%val + 1))
in my branch (formatting/auto_diff) vs.
q0 = powm1(sqrt(pow2(x%val) - 1))
in main
.
These are obviously analytically identical, but numerically not necessarily.
I have tested the suite on a separate branch changing only auto_diff_real_2var_order3
(used by skye_eos
) here and found the same discrepant results as on the formatting branch.
In general it is worrying that our code depends on the implementation of sympy
, although we do provide the fortran code as-is.
I don't think there is an obvious 'fix' to this issue other than freezing the sympy
version we use.
It's not surprising that the results differ as there are a different number of floating point operations being done. What we need to think about is how the round-off error propagates to decide which version is better, and then secondly think about which is faster.
Naively I would say the version in main looks like the better choice. auto_diff has some code to approximate the costs of a function, and it seems like the one with the two powm1's should be a lot more expensive to evaluate than the second one. Thus we might just need to adjust the weights in https://github.com/MESAHub/mesa/blob/main/auto_diff/python/measure.py to recover the second version.
Bit-for-bit has always been a statement about consistency between platforms, possibly at the expense of getting the "correct" answer (even if one can define the correct answer for floating point maths). So I'm not that worried about it, and this is not the first time we have had small floating point differences when things get defined slightly differently.
Bit-for-bit has always been a statement about consistency between platforms, possibly at the expense of getting the "correct" answer (even if one can define the correct answer for floating point maths). So I'm not that worried about it, and this is not the first time we have had small floating point differences when things get defined slightly differently.
I agree with this. The critical thing is bit-for-bit reproducibility within a given version, not necessarily across versions. But you're right that it's alarming that those tests have failed as a result of this small change. The fix is not just to revert the changes to auto diff but to investigate why those tests are so sensitive.
We at some point said we'd run the test suite using the intrinsic math functions from time to time, which might also pick this up. I'll add this to the release checklist.
Agreed. This may be another measure of the fragility of some of our tests. I just pushed a [ci converge]
commit on main
to see if we end up with something similar. We haven't used the [ci converge]
flag in quite a while, so this could be interesting.
Ok, good to know your takes on this.
Given the consensus then, should we ship the python files to generate the auto_diff
files to the users? As we strive for reproducibility and homogeneity within a version, we don't want users to generate their own versions of auto_diff
. Or at least we should warn them such discrepancies might pop up if they decide to do so (or when they create a new auto_diff
module with different vars/orders for particular usecases).
Yes we should just continue to ship the generated fortran files.
Interestingly, there isn't quite as much overlap between the [ci converge]
failures and the failures on the auto_diff
branch as I would have thought. It looks like make_o_ne_wd
did fail on both, so that continues to be a fragile test case that is overly sensitive to numerics. But hb_2M
passed happily on the [ci converge]
commit at a couple different resolutions. I do think it's probably worth spending some time figuring out why that test case is failing here.
The particular auto_diff
file that you touched should only affect Skye as far as I can tell, so it's possible that somehow it introduces an EOS problem. If you aren't already familiar @matthiasfabry, this might be a good opportunity for you to dive into solver debugging for the hb_2M
failure to see if we can learn anything about what's going on with Skye there. I'm happy to help out too if you'd like to work on that together.
I'm wondering if this issue might be partly related to #589 since they both involve auto_diff. gyre_in_mesa_rsg
failed on the branch being tested here, and also needed to be tweaked to pass with the EOS fix in #591, so it might be worth merging that fix in here and testing again to see how it goes.
Interesting. Fewer failures now, but still two: https://testhub.mesastar.org/auto_diff_tests/commits/ec8461f
(Edit: I guess that wasn't a full test though. I just pushed a [ci optional]
commit on that branch to see where we end up.)