lbann
lbann copied to clipboard
Error when mini-batch size is smaller than number of processes
I encounter an error when I change the mini-batch size in one of the Bamboo unit tests, e.g.:
https://github.com/LLNL/lbann/blob/9a9e31cb33fd5460ad6da335ff647aad79088049/bamboo/unit_tests/test_unit_layer_identity.py#L45
If the mini-batch size is less than the number of processes, the following error message shows up before training begins:
LBANN error on rank 3 (/usr/WS1/moon13/src/lbann/include/lbann/layers/io/input/generic_input_layer.hpp:244): I/O buffer does not contain valid samples (0)
The corresponding code is here:
https://github.com/LLNL/lbann/blob/9a9e31cb33fd5460ad6da335ff647aad79088049/include/lbann/layers/io/input/generic_input_layer.hpp#L244
Just to clarify: Are you suggesting that the python code or the C++ code is problematic here (or both)?
To me, it seems that, if an error, it exists in the Python. I think it's reasonable for LBANN to decide it can't handle zero-data processes at the C++ level. If that's an acceptable limitation for everyone, I'd argue that the Python front-end/testing code should take whatever steps to handle the UI side of this: Is it a Python error to provide a minibatch size smaller than the number of processes? Is it just a warning and then the Python corrects this to be max(requested_batch_size, smallest_possible_batch_size)
? Does the python adjust the process count instead of the batch size?
The big problem that I see here is that the error was determined at python time and exposed only after the C++ code had been launched. If this was interactive, that might be ok, but it would be bad to have a huge job pending for hours only to realize you forgot to bump the batch size when you upped the allocation size when it dies 30s in...
I could also see LBANN just being ok with zero-data processes and having them idle. Wasteful, probably, but the linear algebra should be robust to this.
My main use-case for now is a unit test with a mini-batch size of 1. So I suppose it's a bit unrepresentative of "real" use-cases, and I can get around this bug by only using 1 proc. If it's easy, I think the nicest solution would be to let some processes idle, maybe with some obnoxious warning message. In some sense, we already do idling behavior for the last mini-batch in an epoch.
I somewhat disagree that it's not representative. Presumably running out-of-the-box resnet-50 (default mb size of 256) on >64 nodes of Lassen will present this same error, so a naïve strong scaling study might "accidentally" encounter this sort of error. Obviously we can't catch every use-case, but some of these it would be easy to either catch-and-warn (or catch-on-error, whatever) at Python time.