Tim Moon
Tim Moon
I've rebased to incorporate the sequence parallelism support from https://github.com/NVIDIA/NeMo/pull/4380. Pinging @ericharper.
I've made the distributed optimizer dependent on `megatron_amp_O2` instead of being mutually exclusive. I'm not convinced it simplifies the implementation so much as it shifts around the messiness, but it...
My main use-case for now is a unit test with a mini-batch size of 1. So I suppose it's a bit unrepresentative of "real" use-cases, and I can get around...
I have a hypothesis. When we build `lbann.pb.h`, we use the CMake variable `protobuf::protoc` https://github.com/LLNL/lbann/blob/c2e7f2b624ca7a7cddc8b6482028b5e289893e9c/src/proto/CMakeLists.txt#L13 However, this is not set by the [`FindProtobuf` module](https://cmake.org/cmake/help/v3.9/module/FindProtobuf.html). CMake finds an old version of...
The current infrastructure for metrics/objective functions/evaluation layers is a mess that's hurting performance (see #632), so I wonder if this would be a good time to refactor. My proposed scheme...
I made a test model with 20 dropout layers in a row and didn't observe any memory issues. Can you provide more details about your error?
`LBANN_WARNING` is basically a convenience wrapper for printing to `stderr`. I am resistant to adding logic to silence it, since its point is precisely to print on screen. Here's the...
As of 12/1, this is ready to merge.
As of 3/7, this is ready to merge.
Try replacing the `lbann.contrib.launcher.run` with `lbann.proto.save_prototext`: https://github.com/LLNL/lbann/blob/9c94701e30b83a76c252e1a0b4df97b2b7d11021/python/lbann/proto.py#L7 Something like: ```python lbann.proto.save_prototext(prototext_file, trainer=trainer, model=model, data_reader=data_reader, optimizer=opt) ``` The Python frontend assumes you are running LBANN on a system that uses SLURM...