Marks101
Marks101
Dear triton team, I am currently debugging an issue with a kernel that is supposed to replace torch.roll followed by zeroing out the first row of a 2D matrix. This...
Hello team, We are training our GPT style models to produce additional outputs besides the logit tensor. One use case for us is token-wise classification. Another use case is attention...
### System Info - H100 DGX - CUDA 12.1 - TensorRT-LLM 0.10.0.dev2024041600 ### Who can help? @byshiue ### Information - [X] The official example scripts - [ ] My own...
Hi team, I am wondering about the definition of the attention mask in transformer-engine. I did not find an explanation in the docs. Does True mean that the position takes...
[PyTorch] FlashAttention: causal masking enforced in cross attention due to sliding window attention
Hi transformer-engine team, we noticed that in a decoder layer with `self_attn_mask_type="causal"`, the following line https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py#L594 sets window_size = (-1, 0). This window_size is passed to both self attention as...
Hello, we noticed training instabilities in some of our FP8 trainings that use `transformer_engine.pytorch.Linear` to build a custom transformer stack. Looking into the checkpoints, we noticed that `scaling_bwd` deviates between...
Hello team, We typically use `gather_all_token_logits` to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example,...
# Description Hello, we noticed a few issues for encoder decoder models with cross attention. I would like to suggest the following changes to fix them. ## Type of change...
Hello team, we have been debugging large scale training instabilities with FP8 and noticed that these started when updating from transfomer-engine v1.2.1 to v1.7. Taking a closer look at the...
Hello team, we noticed training instabilities when combining FP8 and activation checkpointing with `transformer_engine.pytorch.checkpoint`. When taking a closer look at this, we got the feeling that the FP8 scales in...