Brian Van Essen issues

Results 45 issues of


                                            Brian Van Essen

Make checkpoint helper functions take values by reference

```inline bool read_latest(std::string filename, execution_mode& mode, size_t& epochLast, size_t& trainLast)```

refactor

Change the fp_setup_ouput function in layer class to use size_t

void fp_setup_outputs(size_t mini_batch_size) override

refactor

Move execution functions documentation to member functions

> It would be good to have this documentation in the documentation for the member functions. As a user of this interface working with the generated doxygen or sphinx documentation,...

refactor

Unify or align the callbacks for checkpoint, save model, and dump weights

These three callbacks all output the weight matrices and other common data structures. We should unify or align how they select the output directory, etc.

refactor

Share I/O thread pools between trainers

If there are multiple trainers per node, it may make sense to share the I/O thread pool between trainers.

enhancement

Restarting from checkpoint with a different percentage of data set crashes

Running the checkpoint and restart example where the checkpoint was created with a --data_reader_percent=0.01 and the restart uses the entire data set will crash.

bug

Split the callbacks into model and training algorithm sets

With the trainer PR, it is now clear that callbacks should be owned by the model or training algorithm. These should be separated. This split should also make it easier...

refactor

Move the train and evaluate shortcut methods out of the trainer class

and into the training algorithm. They were added to minimize the impact on the lbann front end files.

Cleanup persist structures for execution contexts

Look at merging all of the individual execution contexts persist states.

enhancement

refactor

Add aws ofi rccl

AWS OFI RCCL is a plug-in which enables EC2 developers to use [libfabric](https://github.com/ofiwg/libfabric) as a network provider while running [AMD's RCCL](https://github.com/ROCmSoftwarePlatform/rccl) based applications.

new-version

new-package

dependencies

update-package

conflicts

maintainers

new-variant