lbann icon indicating copy to clipboard operation
lbann copied to clipboard

RNG State should belong to execution context

Open bvanessen opened this issue 5 years ago • 3 comments

The RNG state needs to be made model specific, so that in tests like lbann2 the order in which we initialize the models should not impact their current or future state. To solve this problem we probably should place the RNG state into the execution context for training, testing, etc. That way when we save and restore we get the proper RNG state for the execution phase.

bvanessen avatar Sep 25 '19 22:09 bvanessen

I wonder if this is the right way to ensure that the checkpoint tests proceed deterministically. Couldn't we also do it by avoiding usage of the RNG state entirely (by not shuffling data and by providing initial weight values)?

timmoon10 avatar Sep 25 '19 22:09 timmoon10

We could avoid random states for checkpoint and restart, but then we are not testing a complete workflow.

Brian C. Van Essen [email protected] (w) 925-422-9300 (c) 925-290-5470

On Sep 25, 2019, at 3:58 PM, Tim Moon [email protected] wrote:

I wonder if this is the right way to ensure that the checkpoint tests proceed deterministically. Couldn't we also do it by avoiding usage of the RNG state entirely (by not shuffling data and by providing initial weight values).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

bvanessen avatar Sep 26 '19 03:09 bvanessen

My thinking is that a unit test for checkpointing should use as little of the workflow as possible. For the complete workflow, I think bit-perfect reproducibility is an excessively strict standard. Something akin to "matches expected learning behavior" seems more reasonable.

timmoon10 avatar Sep 26 '19 17:09 timmoon10