Carlos Mocholí

Results 427 comments of Carlos Mocholí

If we support cross-reduction, then using the same key for multiple dataloaders is not an error but a feature, as it would be the mechanism to do it.

If the 10 dataloaders use the same key and we support (3), the process would be: 1. Wait for training end 1. Take the average over the 10 values (by...

Not for now: https://github.com/NVIDIA/TransformerEngine/issues/401

TransformerEngine, and then we would need to integrate whatever is changed into Lightning. cc @sbhavani in case you know about the progress for this

@sbhavani I see https://github.com/NVIDIA/TransformerEngine/tree/main/examples/pytorch/fsdp exists now. Is your last comment still valid?

So do you want that I add a job config argument for `with_stack` only? Or for both?

I can work around this by setting ```python tmodel._lc_cd.process_group_for_ddp = tmodel._lc_cd.fn.process_group_for_ddp ``` since `thunder` gets this information at `jit()` time: https://github.com/Lightning-AI/lightning-thunder/blob/94c94948b79875ba5247b5c986afa088d970a49d/thunder/common.py#L224-L226 So my question is: could we delay accessing this...

> Currently the ddp transformation is applied during the JITing (i.e. while the interpreter runs). This is fundamentally incompatible with what you're trying to do. I would appreciate some pointers...

We still need to support `jit(ddp(model))`, as this is basically what happens whenever you jit a function and not the model. What I'm advocating for is something like `jit(ddp(undo_jit(jit(model)))` Where...

> I would like to train a lit-gpt model with a context length of 4096. I want to confirm that the only thing I need to do is to modify...