GOMC icon indicating copy to clipboard operation
GOMC copied to clipboard

Possible Bad Integer Output in Checkpointing

Open LSchwiebert opened this issue 5 years ago • 2 comments

Describe the bug

The checkpoint code uses a union structure for binary output of the integer values used for a checkpoint. This union structure is pairing a 32-bit unsigned integer with an 8-byte character array, but a 32-bit integer is only 4 bytes. So, four bytes of this union are not initialized properly and whatever values were stored before are written.

To Reproduce

Any test where the config file specifies a checkpoint file will experience this. If the data in these four bytes happens to be zero, things will work properly but there is no guarantee that things will work correctly since these bytes are never initialized.

Expected behavior

We should output and input the correct number of bytes for the union.

Screenshots

N/A

Input files

Any example that includes checkpointing will be fine.

Please complete the following information:

Version 2.70, development branch but the bug probably exists in several releases.

Additional context

I don't think we need to output ints in binary like we do with double precision numbers, because there are no precision issues. However, if we want to do this then we should at least match the sizes. If we want to maintain compatibility then we should change the integer value in the union to a 64-bit integer. The other option is to change the union to 32 bits and pair it with a 4-byte character array. We would need another union for ulong values like step. I can create the patch once we agree on how we should fix this.

LSchwiebert avatar Nov 07 '20 02:11 LSchwiebert

@jpotoff, @LSchwiebert suggested a good way of handling integer outputs to a binary file, however, the function (htonl) only supports 32 bits integers. That means the checkpoint output will support step numbers up to 4,294,967,295 . Would that be okay? Will there be any case where someone would run the simulation longer than that?

GOMC-WSU avatar Nov 12 '20 21:11 GOMC-WSU

In the interests of documenting the ideas that Younes and I developed while discussing how to resolve this, here are the details. We are proposing to update the checkpoint format to output multi-byte integers using the same technique that the TCP/IP protocols output multi-byte integers, using functions such as htonl to output the integers in a standard network byte order. When these integers are read in from a checkpoint file, the opposite function of ntohl is used to convert from network to host byte order. In this way, checkpoint files will be portable across platforms and processors.

In order to load checkpoint files generated with the existing format as well as the new format, we will output an encoded version number at the beginning of the checkpoint file, such as $2.71$ or @2.71@ or $version 2.71 and test this first byte. This not only allows the code to distinguish older checkpoint files from the ones using this new format, but it also offers a forward path for any future changes. So, any future changes to the checkpoint format could be accommodated by changing this version number to the latest version number and loading the data in this new format based on the version number value.

The easiest way to support both formats seems to be to write member functions in the current CheckpointSetup class with an additional parameter, the version number. Then we have both with no overloading and based on reading the first byte, call one set of functions or the other. (The CheckpointOutput class can be changed to the new format without saving the old format.)

LSchwiebert avatar Nov 13 '20 00:11 LSchwiebert