mengzhangyuan issues

Results 9 issues of


                                            mengzhangyuan

something about the additive attention

Hello, I think If you want the additive attention be able to deal with batch, while inputs are like these Inputs: query, value - **query** (batch_size, q_len, hidden_dim): tensor containing...

implementation detail about alibi_mask

Hello, I am reading the code for generating alibi_mask with link https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py for the code in line 760 and line 761 self.alibi = self.slopes.unsqueeze(1).unsqueeze(1) * torch.arange(maxpos).unsqueeze(0).unsqueeze(0).expand(attn_heads, -1, -1) #line760 self.alibi...

question

How is micro-batch-size influencing the throughput per GPU ?

Hi, I am testing how is micro-batch-size influencing the throughput per GPU with a constant global-batch-size. The result shows that as the micro-batch-size increases, the throughput per GPU(TFLOP/s/GPU) also increases....

how to use TikTokenizer during Training?

Hello, I noticed that ML now support TikTokenizer by setting the **--tokenizer-type** argument. But I do not know what i should set with **--tokenizer-model**. I have checked the source code...

<QUESTION> Does llama3 add <EOS> token when do pretraining?

Hello, I noticed that the llama3 tokenizer loaded with hf transformers.AutoTokenizer only add a token when call the encode function. May I ask during llama3 pretraining, which behavior is taken?...

how to find the correct (token_id, byte_val) relationship for llama3 tokenizer?

Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map. For example, with...

[QUESTION] rotary position embedding

hello, from the code https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/common/embeddings/rope_utils.py#L116 it shows when calling the _apply_rotary_pos_emb_bshd function, the behavior of MLA is different from normal GQA or MHA. The code shows for MLA, there are...

[BUG] recompute leads to incorrect "load_balancing_loss"

**Describe the bug** when do recompute in the moe layer, code in https://github.com/NVIDIA/Megatron-LM/blob/f715dd857be63ca6811577baf2192f13211e5216/megatron/core/transformer/moe/router.py#L251 make the "save_to_aux_losses_tracker" called twice , which result in double load_balancing_loss value records in logs. should skip...

stale

[QUESTION] MOE training meet abnormal gradient norm and loss

Hello, I am training a MOE model (16B total and 2.5B activated) and below are some tensorboard logs, **grad norm** ![Image](https://github.com/user-attachments/assets/d9be9c3b-ab20-4d65-9d65-7ef052e3f657) ![Image](https://github.com/user-attachments/assets/55568b17-5592-4e04-8b70-95f4dade3f01) **Lm loss** ![Image](https://github.com/user-attachments/assets/e9e8ac5f-5acf-42b5-941b-668b274e7084) ![Image](https://github.com/user-attachments/assets/31584462-dd44-44b9-884c-9da3a4233147) **load_balance_loss** ![Image](https://github.com/user-attachments/assets/614bf749-9ba4-4d78-a8b4-c284bd41b251) ![Image](https://github.com/user-attachments/assets/7ddc1b2a-3f3f-44f9-9059-d14e0c58f34d) as...

stale