Marks101

Results 10 issues of Marks101

Dear triton team, I am currently debugging an issue with a kernel that is supposed to replace torch.roll followed by zeroing out the first row of a 2D matrix. This...

Hello team, We are training our GPT style models to produce additional outputs besides the logit tensor. One use case for us is token-wise classification. Another use case is attention...

feature request

### System Info - H100 DGX - CUDA 12.1 - TensorRT-LLM 0.10.0.dev2024041600 ### Who can help? @byshiue ### Information - [X] The official example scripts - [ ] My own...

bug
triaged

Hi team, I am wondering about the definition of the attention mask in transformer-engine. I did not find an explanation in the docs. Does True mean that the position takes...

Hi transformer-engine team, we noticed that in a decoder layer with `self_attn_mask_type="causal"`, the following line https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/transformer.py#L594 sets window_size = (-1, 0). This window_size is passed to both self attention as...

Hello, we noticed training instabilities in some of our FP8 trainings that use `transformer_engine.pytorch.Linear` to build a custom transformer stack. Looking into the checkpoints, we noticed that `scaling_bwd` deviates between...

Hello team, We typically use `gather_all_token_logits` to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example,...

feature request

# Description Hello, we noticed a few issues for encoder decoder models with cross attention. I would like to suggest the following changes to fix them. ## Type of change...

Hello team, we have been debugging large scale training instabilities with FP8 and noticed that these started when updating from transfomer-engine v1.2.1 to v1.7. Taking a closer look at the...

Hello team, we noticed training instabilities when combining FP8 and activation checkpointing with `transformer_engine.pytorch.checkpoint`. When taking a closer look at this, we got the feeling that the FP8 scales in...