gpt-oss implementation
Description
This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.
✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).
MoE Layer
Enabled Bias
- Status: ✅ Supported
- Implementation: Available in main branch: https://github.com/NVIDIA/Megatron-LM/pull/2038
Attention Mechanisms
Alternating Sliding-Window Attention Pattern
- Status: ✅ Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE
Attention Sinks
- Status: ✅ Implemented - in Transformer Engine and cuDNN
- Reference: Streaming LLM
- Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2148
Activation Functions
Custom SwiGLU with Clamping
- Status: ✅ Supported
- Implementation: Megatron Core added partially fused version as "custom quick GeGLU"
FP8-aware fused kernel merged into Transformer Engine
- Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2161
Positional Encodings
YaRN RoPE Scaling
- Status: ✅ Fully Supported
- Implementation:
- [x] YaRN scaling to 128k+ context
- [x] Integration with existing RoPE
- [x] YaRN for general RoPE/GPT models
- [x] Convergence validation
- Usage:
--position-embedding-type yarnwith YaRN configuration parameters - Reference: arXiv:2309.00071
Megatron Bridge Support
Megatron Bridge provides full GPT-OSS integration:
- ✅ Checkpoint Conversion: Hugging Face ↔ Megatron format
- ✅ Pre-configured Providers:
GPTOSSProvider20BandGPTOSSProvider120B - ✅ Quantization Support: Handles MXFP4 weight dequantization
Megatron Bridge + Megatron-LM Example
PR: https://github.com/NVIDIA/Megatron-LM/pull/2383 provides end-to-end example scripts covering checkpoint conversion (convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)
Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts
So how can I train gptoss by this branch?
We have a guide in NeMo Framework (using Megatron Core): https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/gpt_oss.html.
Megatron-LM training scripts will be added soon
how is the progress of support gpt-oss
Hi, is there any plan to implement intra-doc masking (Attention being only compute on a document not on the full sequence length) ?
Best regards
@cuichenx Any update on long seq optimisations?
Or for this branch to be merged in main?
Thanks for contribution first. So has this implementation already finished or closing? I haven't see any related argument about sliding attention window in megatron/training/argument.py.
@JimmyAwoe GPTOSS model is now provided in megatron bridge.
They use megatron bridge to provide model variants now.
I have given GPT OSS training support here :
https://github.com/NVIDIA/Megatron-LM/pull/2383
@JimmyAwoe simply try it out