Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

gpt-oss implementation

Open sbhavani opened this issue 4 months ago • 8 comments

Description

This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.

✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).

MoE Layer

Enabled Bias

  • Status:Supported
  • Implementation: Available in main branch: https://github.com/NVIDIA/Megatron-LM/pull/2038

Attention Mechanisms

Alternating Sliding-Window Attention Pattern

  • Status:Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE

Attention Sinks

  • Status:Implemented - in Transformer Engine and cuDNN
  • Reference: Streaming LLM
  • Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2148

Activation Functions

Custom SwiGLU with Clamping

  • Status:Supported
  • Implementation: Megatron Core added partially fused version as "custom quick GeGLU"

FP8-aware fused kernel merged into Transformer Engine

  • Related Transformer Engine PR: https://github.com/NVIDIA/TransformerEngine/pull/2161

Positional Encodings

YaRN RoPE Scaling

  • Status:Fully Supported
  • Implementation:
    • [x] YaRN scaling to 128k+ context
    • [x] Integration with existing RoPE
    • [x] YaRN for general RoPE/GPT models
    • [x] Convergence validation
  • Usage: --position-embedding-type yarn with YaRN configuration parameters
  • Reference: arXiv:2309.00071

Megatron Bridge Support

Megatron Bridge provides full GPT-OSS integration:

  • Checkpoint Conversion: Hugging Face ↔ Megatron format
  • Pre-configured Providers: GPTOSSProvider20B and GPTOSSProvider120B
  • Quantization Support: Handles MXFP4 weight dequantization

Megatron Bridge + Megatron-LM Example

PR: https://github.com/NVIDIA/Megatron-LM/pull/2383 provides end-to-end example scripts covering checkpoint conversion (convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)

Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts

sbhavani avatar Aug 11 '25 21:08 sbhavani

So how can I train gptoss by this branch?

Pikachu1412 avatar Aug 13 '25 02:08 Pikachu1412

We have a guide in NeMo Framework (using Megatron Core): https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/gpt_oss.html.

Megatron-LM training scripts will be added soon

sbhavani avatar Aug 13 '25 20:08 sbhavani

how is the progress of support gpt-oss

liuqianchao avatar Oct 02 '25 00:10 liuqianchao

Hi, is there any plan to implement intra-doc masking (Attention being only compute on a document not on the full sequence length) ?

Best regards

jgcb00 avatar Oct 03 '25 13:10 jgcb00

@cuichenx Any update on long seq optimisations?

Or for this branch to be merged in main?

quantLm14 avatar Oct 09 '25 11:10 quantLm14

Thanks for contribution first. So has this implementation already finished or closing? I haven't see any related argument about sliding attention window in megatron/training/argument.py.

JimmyAwoe avatar Oct 23 '25 09:10 JimmyAwoe

@JimmyAwoe GPTOSS model is now provided in megatron bridge.

They use megatron bridge to provide model variants now.

I have given GPT OSS training support here :

https://github.com/NVIDIA/Megatron-LM/pull/2383

Image

@JimmyAwoe simply try it out