Alp Dener
Alp Dener
@ksivaman Did we implement/merge lazy init for TE/PyTorch yet? If so, I can rebase, test and merge this to re-enable the fusion with checkpointing.
Yes, I was able to get everything working for PR #760 yesterday. I'm doing some final code cleanup at the moment and hope to merge it next week after final...
@tohinz I will take a look at how we can automatically handle this in the TE checkpoint tomorrow. In the meantime, you should be able to make this work via...
Hi @tohinz -- could you confirm if PR #791 resolves your issue? Thanks!
@tohinz -- we've merged PR #791 today so I'm closing the issue. Thank you for reporting it!
@yongyanrao #596 recently added support for deferred initialization via `device='meta'` to improve FSDP support for large models. This feature delays memory allocation on device until the FSDP wrap internally calls...
@denizokt [`fp8_model_init()`](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html#transformer_engine.pytorch.fp8_model_init) is not supported with FSDP at the moment. NCCL itself does not support 8-bit floats (see [this discussion](https://github.com/NVIDIA/nccl/issues/1026) for more detail) and FSDP needs to upcast TE Fp8...
@kuangdao TE in general supports `TP size < world size`, but the comm+GEMM overlap has some unique restrictions. The underlying device-to-device comms code currently assumes `TP size == world size`....
@kuangdao -- we merged some changes to comm+GEMM overlap in the last month specifically to address multi-node mixed DP/TP use-cases. This feature is still restricted to `tp_size
Hi @umiswing -- to clarify, the comm+GEMM overlap is currently only possible with single-node **tensor**-parallelism, but it does support multi-node **data**-parallelism. For reference, [`examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py`](https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py) initializes `te.LayerNormMLP` with a single-node `tp_group`,...