phantom icon indicating copy to clipboard operation
phantom copied to clipboard

Results differ after restarting from output

Open conradtchan opened this issue 2 years ago • 10 comments

To reproduce this issue:

$PHANTOM_DIR/scripts/writemake.sh shock > Makefile
make; make setup; make diffdumps

Setup with default options:

./phantomsetup shock

Reduce resolution for quicker testing by changing nx = 32 in shock.setup.

Finish setup:

./phantomsetup shock

Run from the start:

./phantom shock

Copy last output for reference:

cp shock_000020 shock_000020.ref

Restart from an intermediate full output by setting dumpfile = shock_00010 in shock.in.

./phantom shock

Compare the two final outputs

./diffdumps shock_000020 shock_000020.ref

Files differ significantly:


       particle IDs differ            0  times
          positions differ        16416  times
  smoothing lengths differ         7704  times
         velocities differ        16416  times
   thermal energies differ        16416  times

MAX RMS ERROR: 2.8942E-06

 FILES DIFFER

conradtchan avatar Aug 19 '21 04:08 conradtchan

How similar do you expect the results to be given machine rounding, etc...?

jameswurster avatar Aug 19 '21 09:08 jameswurster

Same problem also occurs for HDF5 outputs, so it is not specific to native outputs.

conradtchan avatar Aug 20 '21 02:08 conradtchan

How similar do you expect the results to be given machine rounding, etc...?

It should be possible for results to be bitwise identical after restarting from an output. Indeed, when restarting twice from the same output, the results are identical.

When running with MPI, where the order of operations differs and is non-deterministic, the results only differ by 1.e-15. So 1.e-06 for a simple test problem like this without MPI seems concerning.

conradtchan avatar Aug 20 '21 02:08 conradtchan

about 10^-15 is the typical tolerance for diffdumps, this is something not being done right in the MPI code.

danieljprice avatar Aug 20 '21 02:08 danieljprice

about 10^-15 is the typical tolerance for diffdumps, this is something not being done right in the MPI code.

Just to clarify, this has nothing to do with MPI. The reproducer above is compiled completely without MPI.

I only mentioned MPI to illustrate that even MPI's non-deterministic order of operations preserves results better than restarting. But of course this isn't really a meaningful comparison.

conradtchan avatar Aug 20 '21 03:08 conradtchan

Could this be due to the "extra" derivs call that is performed by initial? i.e. When you restart from a dumpfile, the acceleration/force at the beginning of the step is NOT the same as the force that was calculated during the previous step (since the shock terms make the force velocity dependent)

dliptai avatar Aug 20 '21 07:08 dliptai

I guess this is plausible, rms error is small after all ...

danieljprice avatar Aug 25 '21 00:08 danieljprice

One contribution to this error is that h is stored in single precision, so the starting point for density iterations is different after a restart. But this contribution is probably small, and even after changing h to double precision, there is still an error of roughly 1.e-6.

conradtchan avatar Aug 25 '21 00:08 conradtchan

I think it is likely the extra derivs call. We have never really quantified what the acceptable tolerance is for results to be identical, 10^-6 is certainly lower than various tolerances (e.g. tolv and the h-rho iteration tolerance of 1e-4).

It would be painful to do without this as would require storing accelerations and other things in the dump file. I would be extremely reluctant to do this when the biggest limitation we have at the moment is disk space

danieljprice avatar Aug 25 '21 00:08 danieljprice

Would you consider writing "restartable" files infrequently that contain h in double precision and the forces? E.g. once every 24h of wall time? And perhaps it could be optional, so if reproducibility is not important but disk space is limited, it can be disabled?

conradtchan avatar Aug 25 '21 00:08 conradtchan