Alp Dener comments

Results 42 comments of


                                            Alp Dener

tp_overlap init failed when tp_size != world_size

Hi @liuhatry -- I filed a PR with a fix for this issue. Could you confirm if it works for you? Thanks!

tp_overlap init failed when tp_size != world_size

@liuhatry -- thanks for confirming. I merged the PR so TE/main should now have all the fixes we've discussed here. Please feel free to close the issue here if everything...

[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter

/te-ci pytorch L0 L1

[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy Userbuffers more than once

/te-ci pytorch L0 L1

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap)

Closing in favor of #1846

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: No registered implementation for custom call to xxx for platform CUDA

Hi @MoFHeka -- the JAX/XLA custom op for `te_scaled_upper_triang_masked_softmax_forward` is implemented [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/csrc/extensions/softmax.cpp#L73), exposed via PyBind11 [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/csrc/extensions/pybind.cpp#L41) and registered with XLA for the CUDA platform [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/cpp_extensions/custom_call.py#L20-L21). TE/Flax modules invoke this custom...

Alp Dener

tp_overlap init failed when tp_size != world_size

tp_overlap init failed when tp_size != world_size

[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter

[PyTorch] Fix cuBLAS workspace leak in applications that initialize+destroy Userbuffers more than once

[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap)

jaxlib.xla_extension.XlaRuntimeError: UNIMPLEMENTED: No registered implementation for custom call to xxx for platform CUDA

fp8_model_init doesn't work with DDP

[QUESTION] Does TP overlap support variable sequence length?

[QUESTION] Does TP overlap support variable sequence length?

[QUESTION] Does TP overlap support variable sequence length?