mengzhangyuan
mengzhangyuan
Hello, I think If you want the additive attention be able to deal with batch, while inputs are like these Inputs: query, value - **query** (batch_size, q_len, hidden_dim): tensor containing...
Hello, I am reading the code for generating alibi_mask with link https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py for the code in line 760 and line 761 self.alibi = self.slopes.unsqueeze(1).unsqueeze(1) * torch.arange(maxpos).unsqueeze(0).unsqueeze(0).expand(attn_heads, -1, -1) #line760 self.alibi...
Hi, I am testing how is micro-batch-size influencing the throughput per GPU with a constant global-batch-size. The result shows that as the micro-batch-size increases, the throughput per GPU(TFLOP/s/GPU) also increases....
Hello, I noticed that ML now support TikTokenizer by setting the **--tokenizer-type** argument. But I do not know what i should set with **--tokenizer-model**. I have checked the source code...
Hello, I noticed that the llama3 tokenizer loaded with hf transformers.AutoTokenizer only add a token when call the encode function. May I ask during llama3 pretraining, which behavior is taken?...
Hello, all, as I know llama3 tokenizer is based on byte level BPE, But I can not find the relationship between the token_id and (0-255) byte map. For example, with...
hello, from the code https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/common/embeddings/rope_utils.py#L116 it shows when calling the _apply_rotary_pos_emb_bshd function, the behavior of MLA is different from normal GQA or MHA. The code shows for MLA, there are...
**Describe the bug** when do recompute in the moe layer, code in https://github.com/NVIDIA/Megatron-LM/blob/f715dd857be63ca6811577baf2192f13211e5216/megatron/core/transformer/moe/router.py#L251 make the "save_to_aux_losses_tracker" called twice , which result in double load_balancing_loss value records in logs. should skip...
Hello, I am training a MOE model (16B total and 2.5B activated) and below are some tensorboard logs, **grad norm**   **Lm loss**   **load_balance_loss**   as...