lbann
lbann copied to clipboard
RNG State should belong to execution context
The RNG state needs to be made model specific, so that in tests like lbann2 the order in which we initialize the models should not impact their current or future state. To solve this problem we probably should place the RNG state into the execution context for training, testing, etc. That way when we save and restore we get the proper RNG state for the execution phase.
I wonder if this is the right way to ensure that the checkpoint tests proceed deterministically. Couldn't we also do it by avoiding usage of the RNG state entirely (by not shuffling data and by providing initial weight values)?
We could avoid random states for checkpoint and restart, but then we are not testing a complete workflow.
Brian C. Van Essen [email protected] (w) 925-422-9300 (c) 925-290-5470
On Sep 25, 2019, at 3:58 PM, Tim Moon [email protected] wrote:
I wonder if this is the right way to ensure that the checkpoint tests proceed deterministically. Couldn't we also do it by avoiding usage of the RNG state entirely (by not shuffling data and by providing initial weight values).
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.
My thinking is that a unit test for checkpointing should use as little of the workflow as possible. For the complete workflow, I think bit-perfect reproducibility is an excessively strict standard. Something akin to "matches expected learning behavior" seems more reasonable.