lbann icon indicating copy to clipboard operation
lbann copied to clipboard

Error when mini-batch size is smaller than number of processes

Open timmoon10 opened this issue 4 years ago • 4 comments

I encounter an error when I change the mini-batch size in one of the Bamboo unit tests, e.g.:

https://github.com/LLNL/lbann/blob/9a9e31cb33fd5460ad6da335ff647aad79088049/bamboo/unit_tests/test_unit_layer_identity.py#L45

If the mini-batch size is less than the number of processes, the following error message shows up before training begins:

LBANN error on rank 3 (/usr/WS1/moon13/src/lbann/include/lbann/layers/io/input/generic_input_layer.hpp:244): I/O buffer does not contain valid samples (0)

The corresponding code is here:

https://github.com/LLNL/lbann/blob/9a9e31cb33fd5460ad6da335ff647aad79088049/include/lbann/layers/io/input/generic_input_layer.hpp#L244

timmoon10 avatar May 26 '20 22:05 timmoon10

Just to clarify: Are you suggesting that the python code or the C++ code is problematic here (or both)?

To me, it seems that, if an error, it exists in the Python. I think it's reasonable for LBANN to decide it can't handle zero-data processes at the C++ level. If that's an acceptable limitation for everyone, I'd argue that the Python front-end/testing code should take whatever steps to handle the UI side of this: Is it a Python error to provide a minibatch size smaller than the number of processes? Is it just a warning and then the Python corrects this to be max(requested_batch_size, smallest_possible_batch_size)? Does the python adjust the process count instead of the batch size?

The big problem that I see here is that the error was determined at python time and exposed only after the C++ code had been launched. If this was interactive, that might be ok, but it would be bad to have a huge job pending for hours only to realize you forgot to bump the batch size when you upped the allocation size when it dies 30s in...

benson31 avatar May 26 '20 22:05 benson31

I could also see LBANN just being ok with zero-data processes and having them idle. Wasteful, probably, but the linear algebra should be robust to this.

benson31 avatar May 26 '20 22:05 benson31

My main use-case for now is a unit test with a mini-batch size of 1. So I suppose it's a bit unrepresentative of "real" use-cases, and I can get around this bug by only using 1 proc. If it's easy, I think the nicest solution would be to let some processes idle, maybe with some obnoxious warning message. In some sense, we already do idling behavior for the last mini-batch in an epoch.

timmoon10 avatar May 26 '20 23:05 timmoon10

I somewhat disagree that it's not representative. Presumably running out-of-the-box resnet-50 (default mb size of 256) on >64 nodes of Lassen will present this same error, so a naïve strong scaling study might "accidentally" encounter this sort of error. Obviously we can't catch every use-case, but some of these it would be easy to either catch-and-warn (or catch-on-error, whatever) at Python time.

benson31 avatar May 26 '20 23:05 benson31