Carlos Mocholí
Carlos Mocholí
If we support cross-reduction, then using the same key for multiple dataloaders is not an error but a feature, as it would be the mechanism to do it.
If the 10 dataloaders use the same key and we support (3), the process would be: 1. Wait for training end 1. Take the average over the 10 values (by...
Not for now: https://github.com/NVIDIA/TransformerEngine/issues/401
TransformerEngine, and then we would need to integrate whatever is changed into Lightning. cc @sbhavani in case you know about the progress for this
@sbhavani I see https://github.com/NVIDIA/TransformerEngine/tree/main/examples/pytorch/fsdp exists now. Is your last comment still valid?
So do you want that I add a job config argument for `with_stack` only? Or for both?
I can work around this by setting ```python tmodel._lc_cd.process_group_for_ddp = tmodel._lc_cd.fn.process_group_for_ddp ``` since `thunder` gets this information at `jit()` time: https://github.com/Lightning-AI/lightning-thunder/blob/94c94948b79875ba5247b5c986afa088d970a49d/thunder/common.py#L224-L226 So my question is: could we delay accessing this...
> Currently the ddp transformation is applied during the JITing (i.e. while the interpreter runs). This is fundamentally incompatible with what you're trying to do. I would appreciate some pointers...
We still need to support `jit(ddp(model))`, as this is basically what happens whenever you jit a function and not the model. What I'm advocating for is something like `jit(ddp(undo_jit(jit(model)))` Where...
> I would like to train a lit-gpt model with a context length of 4096. I want to confirm that the only thing I need to do is to modify...