Tim Moon comments

Results 80 comments of


                                            Tim Moon

Add support for Apex distributed Adam optimizer with GPT-3

I've rebased to incorporate the sequence parallelism support from https://github.com/NVIDIA/NeMo/pull/4380. Pinging @ericharper.

Add support for Apex distributed Adam optimizer with GPT-3

I've made the distributed optimizer dependent on `megatron_amp_O2` instead of being mutually exclusive. I'm not convinced it simplifies the implementation so much as it shifts around the messiness, but it...

Error when mini-batch size is smaller than number of processes

My main use-case for now is a unit test with a mini-batch size of 1. So I suppose it's a bit unrepresentative of "real" use-cases, and I can get around...

Could NOT find Protobuf (missing: Protobuf_PROTOC_EXECUTABLE)

I have a hypothesis. When we build `lbann.pb.h`, we use the CMake variable `protobuf::protoc` https://github.com/LLNL/lbann/blob/c2e7f2b624ca7a7cddc8b6482028b5e289893e9c/src/proto/CMakeLists.txt#L13 However, this is not set by the [`FindProtobuf` module](https://cmake.org/cmake/help/v3.9/module/FindProtobuf.html). CMake finds an old version of...

Enable logging of per-step metric

The current infrastructure for metrics/objective functions/evaluation layers is a mess that's hurting performance (see #632), so I wonder if this would be a good time to refactor. My proposed scheme...

Potential bug in dropout layer

I made a test model with 20 dropout layers in a row and didn't observe any memory issues. Can you provide more details about your error?

(LOW priority) Disable LBANN warning output during testing

`LBANN_WARNING` is basically a convenience wrapper for printing to `stderr`. I am resistant to adding logic to silence it, since its point is precisely to print on screen. Here's the...

Remove deprecated Python scripts for ONNX conversion and plotting

As of 12/1, this is ready to merge.

Remove deprecated Python scripts for ONNX conversion and plotting

As of 3/7, this is ready to merge.

How to generate prototext?

Try replacing the `lbann.contrib.launcher.run` with `lbann.proto.save_prototext`: https://github.com/LLNL/lbann/blob/9c94701e30b83a76c252e1a0b4df97b2b7d11021/python/lbann/proto.py#L7 Something like: ```python lbann.proto.save_prototext(prototext_file, trainer=trainer, model=model, data_reader=data_reader, optimizer=opt) ``` The Python frontend assumes you are running LBANN on a system that uses SLURM...