Alp Dener comments

Results 42 comments of


                                            Alp Dener

[PyTorch] Re-enable bias+GELU fusion for non-reentrant checkpointing -- WIP

@ksivaman Did we implement/merge lazy init for TE/PyTorch yet? If so, I can rebase, test and merge this to re-enable the fusion with checkpointing.

[UB] Fixing consistency of error messages.

Yes, I was able to get everything working for PR #760 yesterday. I'm doing some final code cleanup at the moment and hope to merge it next week after final...

te.Checkpoint does not work for nested autocast

@tohinz I will take a look at how we can automatically handle this in the TE checkpoint tomorrow. In the meantime, you should be able to make this work via...

te.Checkpoint does not work for nested autocast

Hi @tohinz -- could you confirm if PR #791 resolves your issue? Thanks!

te.Checkpoint does not work for nested autocast

@tohinz -- we've merged PR #791 today so I'm closing the issue. Thank you for reporting it!

FSDP support

@yongyanrao #596 recently added support for deferred initialization via `device='meta'` to improve FSDP support for large models. This feature delays memory allocation on device until the FSDP wrap internally calls...

FSDP support

@denizokt [`fp8_model_init()`](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html#transformer_engine.pytorch.fp8_model_init) is not supported with FSDP at the moment. NCCL itself does not support 8-bit floats (see [this discussion](https://github.com/NVIDIA/nccl/issues/1026) for more detail) and FSDP needs to upcast TE Fp8...

tp_overlap need tensor parallel is equal world size ?

@kuangdao TE in general supports `TP size < world size`, but the comm+GEMM overlap has some unique restrictions. The underlying device-to-device comms code currently assumes `TP size == world size`....

tp_overlap need tensor parallel is equal world size ?

@kuangdao -- we merged some changes to comm+GEMM overlap in the last month specifically to address multi-node mixed DP/TP use-cases. This feature is still restricted to `tp_size

When will comm-gemm-overlap support multi nodes?

Hi @umiswing -- to clarify, the comm+GEMM overlap is currently only possible with single-node **tensor**-parallelism, but it does support multi-node **data**-parallelism. For reference, [`examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py`](https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py) initializes `te.LayerNormMLP` with a single-node `tp_group`,...