Will Constable
Will Constable
I would rather not introduce an enum value for 'NOT_DEFINED' since it's essentially just a short-term hack. If you prefer to land this PR first and then implement the part...
Can you give a pointer to their mention of this? I'm not too surprised about initialize CUDA, fork, initialize CUDA being unsupported. Hoping that it's at least OK to initialize...
to me, the relevant lines of the log are ``` [rank0]:2024-03-26 19:44:12,171 - root - INFO - Saving a checkpoint at step 1000 [rank0]:[rank0]:[E326 19:44:37.537349911 ProcessGroupNCCL.cpp:1332] [PG 0 Rank 0]...
I still think we need a design review for DCP with regard to timeouts. Directionally, we want to have shorter timeouts when possible to get faster error signals. We should...
just curious, is this gonna land soon or does it have some risk or unfinished business? also looks like this could use a rebase. i got a little confused applying...
> It would be triggered in the rotary embedding computation if this PR is landed oh, is this related to dispatching for complex numbers by any chance?
We discussed offline that when training 'for real' on a cluster, the auto-restart behavior would be messed up if the load path points to another folder, so we need to...
is it safe to skip the if and just call .contiguous() all the time? maybe that is a no-op in the case that x is already contiguous?
some attempts to fix this (1) gets rid of conditionals on dynamic shapes, which gets me past the first tracing errors https://github.com/pytorch/torchtitan/pull/300 (2) does a hack for computing sm_count from...
https://github.com/pytorch/pytorch/pull/123732 was intended to help this case but isn't quite enough. 1) #123732 does not appear to help for calls to `.float()` - it only seems to work for explicit...