Brian Van Essen issues

Results 45 issues of


                                            Brian Van Essen

Enable checkpoint to only save when metric has improved

Add a feature to enable the checkpoint logic to only write one out when the metric has improved.

enhancement

Update callbacks to use trainer friendly naming

Dump gradients and error signals should be updated to use the trainer-safe naming scheme.

bug

Create DC GAN simple model

https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html

Add tests for each optimizer state

Update the C&R test to include permutations of the optimizers.

bug

Stateful callbacks are not checkpointed

Callbacks are all considered to be stateless from a C&R POV. This needs to be addressed.

bug

Refactor model zoo to legacy

Moving the old prototext based application models into a legacy directory. New applications will be based on the python front end and have a new directory structure that groups together...

Up to date build in LC env

Have the CI also rebuild and install a version of LBANN on LC for users to point at.

enhancement

RNG State should belong to execution context

The RNG state needs to be made model specific, so that in tests like lbann2 the order in which we initialize the models should not impact their current or future...

bug

It could be useful to add a method like: `virtual std::unique_ptr get_prototype_execution_context() const = 0;` Then in derived classes implement, e.g., ```c++ std::unique_ptr sgd_training_algorithm::get_prototype_execution_context() const { return make_unique(); } ```...

enhancement

Update layer graph constructor to not require trainer

The proto layer graph constructor should not require the trainer, but it wants the number of parallel readers for the input layers. Once the data reader moves, we can remove...

refactor