Tim Moon
Tim Moon
My thinking is that a unit test for checkpointing should use as little of the workflow as possible. For the complete workflow, I think bit-perfect reproducibility is an excessively strict...
I suspect this is just poor optimization. We haven't put much effort into CNNs on CPUs.
That said, our LeNet integration test takes 5 sec/epoch on 2 Catalyst nodes. I wonder what's causing the difference.
What is the flavor of the error messages, and are they all unique issues?
I would prefer if we made a copy of `execution_algorithms/ltfb/checkpoint_file.{cpp,hpp}` so we can have a dedicated class for `checkpoint_file_dyad`. That might help avoid some complicated control flow.
#91 is related.
Related: #1123.
This is a good idea. There's a bit of hairiness since the Bamboo tests are intended to switch between frontends (although this is broken, see #1289). This makes it hard...
NCHW vs. NHWC is somewhat orthogonal to my concern, since both are using C-style tensor notation. I'm thinking about our internal representation for tensors and our API.
Running T5 41B on 32 Selene nodes, I see a 1.2x speedup over the pure data-parallel impl, 66% of expected memory savings, and nearly identical loss values after 20 steps....