Tim Moon comments

Results 227 comments of


                                            Tim Moon

RNG State should belong to execution context

My thinking is that a unit test for checkpointing should use as little of the workflow as possible. For the complete workflow, I think bit-perfect reproducibility is an excessively strict...

Single-node CPU-only performance is not good.

I suspect this is just poor optimization. We haven't put much effort into CNNs on CPUs.

Single-node CPU-only performance is not good.

That said, our LeNet integration test takes 5 sec/epoch on 2 Catalyst nodes. I wonder what's causing the difference.

(Slightly less low priority) MPI Catch test output

What is the flavor of the error messages, and are they all unique issues?

Add DYAD for checkpointed-file-based LTFB

I would prefer if we made a copy of `execution_algorithms/ltfb/checkpoint_file.{cpp,hpp}` so we can have a dedicated class for `checkpoint_file_dyad`. That might help avoid some complicated control flow.

Change the fp_setup_ouput function in layer class to use size_t

#91 is related.

Clean up error reporting throughout

Related: #1123.

Test Python frontend

This is a good idea. There's a bit of hairiness since the Bamboo tests are intended to switch between frontends (although this is broken, see #1289). This makes it hard...

Notation for tensor and matrix dimensions are inconsistent

NCHW vs. NHWC is somewhat orthogonal to my concern, since both are using C-style tensor notation. I'm thinking about our internal representation for tensors and our API.

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism

Running T5 41B on 32 Selene nodes, I see a 1.2x speedup over the pure data-parallel impl, 66% of expected memory savings, and nearly identical loss values after 20 steps....