ms-swift [bugfix] Fix GKD with TRL >= 0.24 & GKD Liger

Fixes a bug where the GKD loss and gradient norm become unexpectedly large when using TRL >= 0.24

Before fix:

After fix:

Fix GKD Liger loss under student/teacher ZeRO-3

The Liger loss requires access to student_head and teacher_head. When using ZeRO-3, these parameters are sharded across devices, so we must properly gather them before computing the loss.

Fix unexpectedly large GKD Liger loss and gradient norm

The GKD Liger loss was observed to grow abnormally large, leading to unstable training. This is now resolved by normalizing the loss by the sequence length, ensuring consistent scale across different batch and sequence configurations.

see details in logger

confirm update: gkd liger is not normalized https://github.com/linkedin/Liger-Kernel/blob/v0.6.3/src/liger_kernel/chunked_loss/jsd_loss.py#L9-L39

Bump `math_verify` dependency version

Update math_verify to the latest release version. The current implementation is aligned with the newest release, so we bump it.

Nov 19 '25 09:11 hjh0119

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a significant bug impacting the GKD trainer's stability when integrated with TRL library versions 0.24 and newer. The issue manifested as an unexpectedly large GKD loss and gradient norm, leading to potential training instability. The solution involves conditionally normalizing the GKD loss by dividing it by the sequence length, a fix applied only when the TRL version meets or exceeds 0.24, thereby restoring expected training dynamics and preventing numerical overflow.

Highlights

Bug Fix: Addresses a critical bug where the GKD (Generative Knowledge Distillation) loss and gradient norm would become unexpectedly large when using TRL library versions 0.24 or higher.
Conditional Loss Scaling: Introduces a version-dependent scaling factor to the GKD loss calculation, specifically dividing the loss by the sequence length for TRL versions 0.24 and above.
TRL Version Compatibility: Ensures proper numerical stability and training behavior of the GKD trainer across different TRL library versions, particularly with newer releases.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Nov 19 '25 09:11 gemini-code-assist[bot]

/gemini review

Nov 19 '25 12:11 hjh0119

ms-swift ms-swift copied to clipboard

[bugfix] Fix GKD with TRL >= 0.24 & GKD Liger

Fixes a bug where the GKD loss and gradient norm become unexpectedly large when using TRL >= 0.24

Fix GKD Liger loss under student/teacher ZeRO-3

Fix unexpectedly large GKD Liger loss and gradient norm

Bump math_verify dependency version

Summary of Changes

Highlights

ms-swift
ms-swift copied to clipboard

Bump `math_verify` dependency version