Brian Van Essen

Results 45 issues of Brian Van Essen

```inline bool read_latest(std::string filename, execution_mode& mode, size_t& epochLast, size_t& trainLast)```

refactor

void fp_setup_outputs(size_t mini_batch_size) override

refactor

> It would be good to have this documentation in the documentation for the member functions. As a user of this interface working with the generated doxygen or sphinx documentation,...

refactor

These three callbacks all output the weight matrices and other common data structures. We should unify or align how they select the output directory, etc.

refactor

If there are multiple trainers per node, it may make sense to share the I/O thread pool between trainers.

enhancement

Running the checkpoint and restart example where the checkpoint was created with a --data_reader_percent=0.01 and the restart uses the entire data set will crash.

bug

With the trainer PR, it is now clear that callbacks should be owned by the model or training algorithm. These should be separated. This split should also make it easier...

refactor

and into the training algorithm. They were added to minimize the impact on the lbann front end files.

Look at merging all of the individual execution contexts persist states.

enhancement
refactor

AWS OFI RCCL is a plug-in which enables EC2 developers to use [libfabric](https://github.com/ofiwg/libfabric) as a network provider while running [AMD's RCCL](https://github.com/ROCmSoftwarePlatform/rccl) based applications.

new-version
new-package
dependencies
update-package
conflicts
maintainers
new-variant