Daize Dong
Daize Dong
Update the ICML 2024 paper "A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer".
## Description - Updated the computation of `router_logits` in the MoE gate to use `FP32` instead of the default `BF16` to enhance numerical stability. - Ensured that activation functions (`softmax`...
Updated the deprecated artifact so that the actions can run normally. I also updated other actions to their latest versions to avoid similar exceptions in the future.
### ⚠️ Please check that this feature request hasn't been suggested before. - [x] I searched previous [Ideas in Discussions](https://github.com/axolotl-ai-cloud/axolotl/discussions/categories/ideas) didn't find any similar feature requests. - [x] I searched...
### System Info Today when I tried to run `examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh`, a strange error occured: ``` Could not override 'actor_rollout_ref.rollout.n'. To append to your config use +actor_rollout_ref.rollout.n=16 Key 'n' is not...
### What does this PR do? This PR removes the deprecated arguments during Megatron optimizer building for compatibility with the latest Megatron, see [https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/optimizer/__init__.py#L442](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/optimizer/__init__.py#L442). These arguments are never used by...