Brian Van Essen
Brian Van Essen
```inline bool read_latest(std::string filename, execution_mode& mode, size_t& epochLast, size_t& trainLast)```
void fp_setup_outputs(size_t mini_batch_size) override
> It would be good to have this documentation in the documentation for the member functions. As a user of this interface working with the generated doxygen or sphinx documentation,...
These three callbacks all output the weight matrices and other common data structures. We should unify or align how they select the output directory, etc.
If there are multiple trainers per node, it may make sense to share the I/O thread pool between trainers.
Running the checkpoint and restart example where the checkpoint was created with a --data_reader_percent=0.01 and the restart uses the entire data set will crash.
With the trainer PR, it is now clear that callbacks should be owned by the model or training algorithm. These should be separated. This split should also make it easier...
and into the training algorithm. They were added to minimize the impact on the lbann front end files.
Look at merging all of the individual execution contexts persist states.
AWS OFI RCCL is a plug-in which enables EC2 developers to use [libfabric](https://github.com/ofiwg/libfabric) as a network provider while running [AMD's RCCL](https://github.com/ROCmSoftwarePlatform/rccl) based applications.