onnxruntime [WIP] Feature request: Implement GroupNormalization-21

Plan to implement ONNX GroupNormalization-21 support

After analyzing the codebase, I found that:

GroupNormalization is currently only implemented as a contrib op (com.microsoft.GroupNorm)
The standard ONNX GroupNormalization operator is NOT implemented in the CPU provider
It's supported in CoreML provider, indicating it's a valid ONNX operation
The existing test uses opset 18 but only tests CoreML provider

Implementation Checklist:

[ ] Research ONNX GroupNormalization specification details
[ ] Create CPU GroupNormalization kernel implementation
[ ] Register GroupNormalization operation in CPU provider for opsets 18-21
[ ] Create or adapt GroupNormalization implementation logic
[ ] Update tests to cover CPU provider and opset 21
[ ] Test and validate the implementation

Approach:

Follow existing LayerNorm/InstanceNorm patterns for CPU provider implementation
Leverage existing normalization helper code where possible
Implement minimal changes to add standard ONNX support alongside existing contrib op

Fixes #24538.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Jun 14 '25 15:06 Copilot

This is the op spec for your reference:

GroupNormalization - 21
Version
name: [GroupNormalization (GitHub)](https://github.com/onnx/onnx/blob/main/docs/Operators.md#GroupNormalization)

domain: main

since_version: 21

function: True

support_level: SupportType.COMMON

shape inference: False

This version of the operator has been available since version 21.

Summary
A GroupNormalization function. Carries out group normalization as described in the paper https://arxiv.org/abs/1803.08494

This operator transforms input according to

y = scale * (x - mean) / sqrt(variance + epsilon) + bias,
where the mean and variance are computed per instance per group of channels, and scale and bias should be specified for each channel. The number of groups num_groups should be divisible by the number of channels so that there are an equal number of channels per group.

The overall computation has two stages: the first stage normalizes the elements to have zero mean and unit variance for each instance in each group, and the second stage scales and shifts the results of the first stage. The floating-point precision used in the first stage is determined by the stash_type attribute. For example, if stash_type is 1, the operator casts all input variables to 32-bit float, performs the computation, and finally casts the normalized results back to the original type of X. The second stage does not depend on stash_type.

When the number of groups is the same as the number of channels, this operator is equivalent to InstanceNormalization. When there is only one group, this operator is equivalent to LayerNormalization.

Attributes
epsilon - FLOAT (default is '1e-05'):

The epsilon value to use to avoid division by zero.

num_groups - INT (required) :

The number of groups of channels. It should be a divisor of the number of channels C.

stash_type - INT (default is '1'):

The floating-point precision used in stage one of the computation.

Inputs
X (heterogeneous) - T:

Input data tensor. Dimensions for image cases are (N x C x H x W), where N is the batch size, C is the number of channels, and H and W are the height and width of the data. Statistics are computed for every group of channels over C, H, and W. For non-image cases, the dimensions are in the form of (N x C x D1 x D2 ... Dn).

scale (heterogeneous) - T:

Scale tensor of shape (C).

bias (heterogeneous) - T:

Bias tensor of shape (C).

Outputs
Y (heterogeneous) - T:

The output tensor of the same shape as X.

Type Constraints
T in ( tensor(bfloat16), tensor(double), tensor(float), tensor(float16) ):

Constrain input and output types to float tensors.

[GroupNormalization - 18 vs 21](https://onnx.ai/onnx/operators/text_diff_GroupNormalization_18_21.html)

Jun 14 '25 16:06 justinchuby

@justinchuby Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 5e4b9df7-12c0-4e39-8582-7eec84f30bb3

Sorry for the inconvenience!

Jun 14 '25 16:06 Copilot

@copilot try again

Jun 14 '25 22:06 justinchuby

@copilot try again

I've implemented the GroupNormalization-21 specification based on the details you provided. Key changes include:

Proper stash_type implementation: When stash_type=1 (default), the first stage (mean/variance calculation) now uses float32 precision as specified, then casts back to original type for the second stage (scale/bias application).
Added BFloat16 support: Extended type support to include tensor(bfloat16) as required by the specification.
Fixed computation logic: Updated the implementation to exactly match the ONNX spec formula: y = scale * (x - mean) / sqrt(variance + epsilon) + bias

The implementation now properly handles the two-stage computation as described in the specification, where stage one precision is controlled by stash_type and stage two uses original type precision.

Jun 14 '25 22:06 Copilot

@copilot do not try to build the project when you are iterating, because the build is going to take too long

Jun 14 '25 22:06 justinchuby

@copilot Also try to register the cuda kernels if you can

Jun 14 '25 22:06 justinchuby

@copilot do not try to build the project when you are iterating, because the build is going to take too long

Understood, I'll avoid building during iteration to save time. I'll focus on the code implementation and skip build validation during development.

@copilot Also try to register the cuda kernels if you can

I've added CUDA kernel support for GroupNormalization. Created:

CUDA GroupNorm implementation (group_norm.h/cc)
CUDA kernel implementation (group_norm_impl.h/cu)
Registered kernels in CUDA execution provider for opsets 18-21
Added support for float, double, MLFloat16, and BFloat16 types

The implementation follows the same two-stage computation pattern as the CPU version with proper stash_type support.

Jun 14 '25 22:06 Copilot