Alp Dener
Alp Dener
Hi @liuhatry -- I filed a PR with a fix for this issue. Could you confirm if it works for you? Thanks!
@liuhatry -- thanks for confirming. I merged the PR so TE/main should now have all the fixes we've discussed here. Please feel free to close the issue here if everything...
/te-ci pytorch L0 L1
/te-ci pytorch L0 L1
Closing in favor of #1846
Hi @MoFHeka -- the JAX/XLA custom op for `te_scaled_upper_triang_masked_softmax_forward` is implemented [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/csrc/extensions/softmax.cpp#L73), exposed via PyBind11 [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/csrc/extensions/pybind.cpp#L41) and registered with XLA for the CUDA platform [here](https://github.com/NVIDIA/TransformerEngine/blob/5ee98175788d2c3c3945980e0c12fb8dfc6ea94d/transformer_engine/jax/cpp_extensions/custom_call.py#L20-L21). TE/Flax modules invoke this custom...
@MaciejBalaNV Transformer Engine modules that are initialized under `te.pytorch.fp8_model_init()` still need to be executed under `te.pytorch.fp8_autocast()` with an FP8 recipe for operations that we have to perform in higher precision....
TP overlap currently requires sequence parallelism and does not have any attention layout/format restrictions except that the sequence length has to be constant and evenly divisible by TP size. Since...
Unfortunately the current implementation does not support variable sequence lengths, so you would have to pad your sequences up to a static maximum. Theoretically there is no reason why it...
I hope to integrate cuBlasMp into TE by mid-December at the latest. There's a chance this might support variable sequence lengths out of the box, but otherwise it would have...