Tim Moon

Results 227 comments of Tim Moon

My thinking is that a unit test for checkpointing should use as little of the workflow as possible. For the complete workflow, I think bit-perfect reproducibility is an excessively strict...

I suspect this is just poor optimization. We haven't put much effort into CNNs on CPUs.

That said, our LeNet integration test takes 5 sec/epoch on 2 Catalyst nodes. I wonder what's causing the difference.

What is the flavor of the error messages, and are they all unique issues?

I would prefer if we made a copy of `execution_algorithms/ltfb/checkpoint_file.{cpp,hpp}` so we can have a dedicated class for `checkpoint_file_dyad`. That might help avoid some complicated control flow.

This is a good idea. There's a bit of hairiness since the Bamboo tests are intended to switch between frontends (although this is broken, see #1289). This makes it hard...

NCHW vs. NHWC is somewhat orthogonal to my concern, since both are using C-style tensor notation. I'm thinking about our internal representation for tensors and our API.

Running T5 41B on 32 Selene nodes, I see a 1.2x speedup over the pure data-parallel impl, 66% of expected memory savings, and nearly identical loss values after 20 steps....