Megatron-LM gpt-oss implementation

Description

This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.

✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).

MoE Layer

Enabled Bias

Status: ✅ Supported
Implementation: Available in main branch: https://github.com/NVIDIA/Megatron-LM/pull/2038

Attention Mechanisms

Alternating Sliding-Window Attention Pattern

Status: ✅ Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE

Attention Sinks

Status: ✅ Implemented - in Transformer Engine and cuDNN
Reference: Streaming LLM
Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2148

Activation Functions

Custom SwiGLU with Clamping

Status: ✅ Supported
Implementation: Megatron Core added partially fused version as "custom quick GeGLU"

FP8-aware fused kernel merged into Transformer Engine

Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2161

Positional Encodings

YaRN RoPE Scaling

Status: ✅ Fully Supported
Implementation:
- [x] YaRN scaling to 128k+ context
- [x] Integration with existing RoPE
- [x] YaRN for general RoPE/GPT models
- [x] Convergence validation
Usage: --position-embedding-type yarn with YaRN configuration parameters
Reference: arXiv:2309.00071

Megatron Bridge Support

Megatron Bridge provides full GPT-OSS integration:

✅ Checkpoint Conversion: Hugging Face ↔ Megatron format
✅ Pre-configured Providers: GPTOSSProvider20B and GPTOSSProvider120B
✅ Quantization Support: Handles MXFP4 weight dequantization

Megatron Bridge + Megatron-LM Example

PR: https://github.com/NVIDIA/Megatron-LM/pull/2383 provides end-to-end example scripts covering checkpoint conversion (convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)

Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts

Aug 11 '25 21:08 sbhavani

So how can I train gptoss by this branch?

Aug 13 '25 02:08 Pikachu1412

We have a guide in NeMo Framework (using Megatron Core): https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/gpt_oss.html.

Megatron-LM training scripts will be added soon

Aug 13 '25 20:08 sbhavani

how is the progress of support gpt-oss

Oct 02 '25 00:10 liuqianchao

Hi, is there any plan to implement intra-doc masking (Attention being only compute on a document not on the full sequence length) ?

Best regards

Oct 03 '25 13:10 jgcb00

@cuichenx Any update on long seq optimisations?

Or for this branch to be merged in main?

Oct 09 '25 11:10 quantLm14

Thanks for contribution first. So has this implementation already finished or closing? I haven't see any related argument about sliding attention window in megatron/training/argument.py.

Oct 23 '25 09:10 JimmyAwoe

@JimmyAwoe GPTOSS model is now provided in megatron bridge.

They use megatron bridge to provide model variants now.

Nov 22 '25 05:11 yiakwy-xpu-ml-framework-team

I have given GPT OSS training support here :

https://github.com/NVIDIA/Megatron-LM/pull/2383

@JimmyAwoe simply try it out

Nov 24 '25 20:11 yiakwy-xpu-ml-framework-team