transformers.js VideoFrame support (WebCodecs)

Feature request

I noticed that the video examples use canvas.context.getImageData to get raw pixel values and construct a transformers.js RawImage. It might be possible to use VideoFrame instead.

VideoFrame can be constructed from a canvas, raw pixel data, or decoded from a video file via WebCodecs. It can then be copied into a GPUTexture via copyExternalImageToTexture that is then accessible by WebGPU, avoiding the CPU entirely. Or the raw pixel data can be copied from GPU -> CPU when using WASM.

Motivation

Accessing pixel data for video frames involves a copy from the GPU to the CPU, and when using the WebGPU backend, another copy from the CPU back to the GPU. This is a classic video performance issue and will hurt the frame rate, although I don't know by how much.

More info: https://developer.chrome.com/blog/from-webgl-to-webgpu#video_frame_processing

Your contribution

I don't know enough about the underlying ONNX runtime to know if this would work. If it sounds reasonable I could try hacking something together.

Aug 03 '25 20:08 kixelated

Going to take a break because there's other things I should be working on, but I did manage to get "working" output. I think there's a bug somewhere in the shader as it doesn't produce the exact same output as the ImageProcessor, but it's close enough for the object detection and depth estimation work with a slightly higher frame rate. (~7 -> ~9 on my M3). The depth estimation demo could be further improved, as the resulting buffer could be kept on the GPU too.

@xenova Great library. Do you think a GPU analog of RawImage would make sense? Basically convert ImageProcessor to a shader.

Vibe coded example. I purposely didn't want to modify transformers.js and managed to figure out enough work-arounds:

import {
	AutoModel,
	AutoProcessor,
	env,
	ImageProcessor,
	Tensor,
} from "@huggingface/transformers";

const modelId = "Xenova/gelan-c_all";

const model = await AutoModel.from_pretrained(modelId, {
	device: "webgpu",
	dtype: "fp32",
});
const processor = await AutoProcessor.from_pretrained(modelId);
const featureExtractor = processor.feature_extractor;
if (!(featureExtractor instanceof ImageProcessor)) throw new Error("Feature extractor is not an ImageProcessor");

// TODO Remove the sleep; this is a hack to wait for the device to be initialized.
// There's probably a better way to do this or just modify transformers.js to expose it.
await new Promise((resolve) => setTimeout(resolve, 1000));

// We need to use the same device as onnx otherwise we get a device mismatch error for the gpu-buffer.
const device = await env.backends.onnx.webgpu?.device;
if (!device) throw new Error("Device not found");

// Initialize shader module.
// Claude wrote this; I haven't really inspected it for correctness lul. 100% something is wrong.
const shaderCode = `
@group(0) @binding(0) var inputTexture: texture_external;
@group(0) @binding(1) var<storage, read_write> outputBuffer: array<f32>;

struct Uniforms {
	inputDims: vec2<f32>,  // width, height
	outputDims: vec2<f32>, // width, height
	scale: vec2<f32>,      // scaling factors
	rescaleFactor: f32,
	doNormalize: f32,      // 0.0 or 1.0 for boolean
	mean: vec3<f32>,
	_pad0: f32,            // padding for alignment
	stdDev: vec3<f32>,
	_pad1: f32,            // padding for alignment
};
@group(0) @binding(2) var<uniform> uniforms: Uniforms;

fn sampleBilinear(tex: texture_external, uv: vec2<f32>) -> vec4<f32> {
	let texSize = uniforms.inputDims;
	let texCoord = uv * texSize - 0.5;
	let tl = floor(texCoord);
	let br = tl + 1.0;
	let f = fract(texCoord);

	// Clamp coordinates
	let tlClamped = clamp(tl, vec2<f32>(0.0), texSize - 1.0);
	let brClamped = clamp(br, vec2<f32>(0.0), texSize - 1.0);

	// Convert to integer coordinates for textureLoad
	let tlInt = vec2<i32>(i32(tlClamped.x), i32(tlClamped.y));
	let brInt = vec2<i32>(i32(brClamped.x), i32(brClamped.y));

	// Sample four pixels
	let p00 = textureLoad(tex, tlInt);
	let p10 = textureLoad(tex, vec2<i32>(brInt.x, tlInt.y));
	let p01 = textureLoad(tex, vec2<i32>(tlInt.x, brInt.y));
	let p11 = textureLoad(tex, brInt);

	// Bilinear interpolation
	let p0 = mix(p00, p10, f.x);
	let p1 = mix(p01, p11, f.x);
	return mix(p0, p1, f.y);
}

@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
	let x = id.x;
	let y = id.y;
	let outputDims = vec2<u32>(uniforms.outputDims);

	if (x >= outputDims.x || y >= outputDims.y) {
		return;
	}

	// Calculate normalized UV coordinates
	let uv = (vec2<f32>(f32(x), f32(y)) + 0.5) / uniforms.outputDims;

	// Sample with bilinear interpolation
	let pixel = sampleBilinear(inputTexture, uv);

	// Apply preprocessing: rescale first (usually from [0,255] to [0,1])
	var processed = pixel.rgb * uniforms.rescaleFactor;

	// Apply normalization if enabled
	if (uniforms.doNormalize > 0.5) {
		processed = (processed - uniforms.mean) / uniforms.stdDev;
	}

	// Write to buffer in NCHW format (batch=1, channels=3, height, width)
	let pixelIndex = y * outputDims.x + x;
	let channelStride = outputDims.x * outputDims.y;

	outputBuffer[0 * channelStride + pixelIndex] = processed.r;
	outputBuffer[1 * channelStride + pixelIndex] = processed.g;
	outputBuffer[2 * channelStride + pixelIndex] = processed.b;
}
`;

const shaderModule = device.createShaderModule({
	code: shaderCode,
});

const pipeline = device.createComputePipeline({
	layout: "auto",
	compute: {
		module: shaderModule,
		entryPoint: "main",
	},
});

// Left to the reader to figure out how to get a VideoFrame.
// You can construct it from a variety of sources: https://developer.mozilla.org/en-US/docs/Web/API/VideoFrame
const videoFrame = new VideoFrame(new ImageBitmap());

const { codedWidth, codedHeight } = videoFrame;

// Get the target size from processor config
// The processor resizes based on shortest_edge while maintaining aspect ratio
const targetSize = featureExtractor.size.shortest_edge || featureExtractor.size.height || featureExtractor.size;

// Calculate dimensions maintaining aspect ratio
let targetWidth: number;
let targetHeight: number;
if (codedWidth < codedHeight) {
	// Width is shorter
	targetWidth = Math.round(targetSize);
	targetHeight = Math.round((targetSize * codedHeight) / codedWidth);
} else {
	// Height is shorter
	targetHeight = Math.round(targetSize);
	targetWidth = Math.round((targetSize * codedWidth) / codedHeight);
}

// Ensure dimensions are divisible by 32 (if size_divisibility is set)
const divisibility = featureExtractor.size_divisibility || 1;
targetWidth = Math.floor(targetWidth / divisibility) * divisibility;
targetHeight = Math.floor(targetHeight / divisibility) * divisibility;

// Import VideoFrame as external texture
const externalTexture = device.importExternalTexture({
	source: videoFrame,
});

// Create output buffer for tensor data
const outputBufferSize = 3 * targetWidth * targetHeight * 4; // 3 channels * width * height * sizeof(float32)
// Align buffer size to 16 bytes as required by WebGPU
const alignedBufferSize = Math.ceil(outputBufferSize / 16) * 16;

const outputBuffer = device.createBuffer({
	size: alignedBufferSize,
	// Include all necessary usage flags for ONNX Runtime WebGPU
	usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
});

// Create uniform buffer with preprocessing parameters
// Ensure proper alignment (vec3 needs 16-byte alignment in uniforms)
// Check if do_normalize is explicitly true (config shows it's false by default)
const doNormalize = featureExtractor.do_normalize === true ? 1.0 : 0.0;
const mean = featureExtractor.image_mean || [0.485, 0.456, 0.406];
const std = featureExtractor.image_std || [0.229, 0.224, 0.225];
const rescaleFactor = 1.0; // Don't rescale since input is already [0,1]

const uniformData = new Float32Array([
	codedWidth,
	codedHeight, // inputDims (vec2)
	targetWidth,
	targetHeight, // outputDims (vec2)
	codedWidth / targetWidth, // scale.x
	codedHeight / targetHeight, // scale.y
	rescaleFactor, // rescaleFactor
	doNormalize, // doNormalize (as float)
	mean[0],
	mean[1],
	mean[2],
	0, // mean (vec3 + padding)
	std[0],
	std[1],
	std[2],
	0, // std (vec3 + padding)
]);

const uniformBuffer = device.createBuffer({
	size: uniformData.byteLength,
	usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
});
device.queue.writeBuffer(uniformBuffer, 0, uniformData);

// Create bind group
const bindGroup = device.createBindGroup({
	layout: pipeline.getBindGroupLayout(0),
	entries: [
		{ binding: 0, resource: externalTexture },
		{ binding: 1, resource: { buffer: outputBuffer } },
		{ binding: 2, resource: { buffer: uniformBuffer } },
	],
});

// Execute compute shader
const commandEncoder = device.createCommandEncoder();
const computePass = commandEncoder.beginComputePass();
computePass.setPipeline(pipeline);
computePass.setBindGroup(0, bindGroup);
computePass.dispatchWorkgroups(Math.ceil(targetWidth / 8), Math.ceil(targetHeight / 8));
computePass.end();

// Submit GPU commands
device.queue.submit([commandEncoder.finish()]);

// Wait for GPU operations to complete
await device.queue.onSubmittedWorkDone();

// Create ONNX tensor from GPU buffer
// Note: This requires ONNX Runtime Web with WebGPU support
const dims = [1, 3, targetHeight, targetWidth];

// Create an ONNX tensor that references the GPU buffer
// This would avoid copying data to CPU, but currently fails due to device mismatch
// @ts-expect-error Abusing transformers.js internals to construct an ort Tensor (not exported).
const tensor = new Tensor({ location: "gpu-buffer", type: "float32", gpuBuffer: outputBuffer, dims });

// Cleanup - only destroy uniform buffer
// Don't destroy outputBuffer since it's being used by the GPU tensor
// ONNX Runtime will manage its lifecycle
uniformBuffer.destroy();

// Add required metadata for transformers.js
const inputs = {
	pixel_values: tensor,
	original_sizes: [[codedHeight, codedWidth]],
	reshaped_input_sizes: [[targetHeight, targetWidth]],
};

const { outputs } = await model(inputs);

const [height, width] = inputs.reshaped_input_sizes[0];

for (const [xmin, ymin, xmax, ymax, score, id] of outputs.tolist()) {
	if (score < 0.5) continue;

	// @ts-expect-error
	const label = model.config.id2label[id];
	console.log(label, score);
}

Aug 08 '25 19:08 kixelated

And just to further clarify, virtually all of the texture -> tensor code is AI written and likely subtly wrong. I'm just excited that it seems to work; I haven't reviewed it for correctness or written an exhaustive shader yet.

Aug 08 '25 19:08 kixelated

This actually somehow intersects with the thing I recently did, for the WASM part - https://github.com/JSmith01/ai-image-converter This library is to be published to NPM, and it's not 100% complete (but works).

I'm also interested in direct WebGPU data pass. Indeed, onnxruntime-web supports direct loading from WebGPU buffers: https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#create-input-tensor-from-a-gpu-buffer

And even for the output (if we're speaking about realtime data processing) it is not necessary to pass results to CPU memory, the final tensor could be transformed to an ordinary image texture, to be later used either in an app WebGPU pipeline, or put to an OffscreenCanvas / HTMLCanvasElement / VideoFrame.

Aug 15 '25 02:08 JSmith01

This actually somehow intersects with the thing I recently did, for the WASM part - https://github.com/JSmith01/ai-image-converter This library is to be published to NPM, and it's not 100% complete (but works).

I'm also interested in direct WebGPU data pass. Indeed, onnxruntime-web supports direct loading from WebGPU buffers: https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#create-input-tensor-from-a-gpu-buffer

And even for the output (if we're speaking about realtime data processing) it is not necessary to pass results to CPU memory, the final tensor could be transformed to an ordinary image texture, to be later used either in an app WebGPU pipeline, or put to an OffscreenCanvas / HTMLCanvasElement / VideoFrame.

Yeah I spent a few hours doing it properly by using onnxruntime-web directly. I was getting mixed performance results, but annecdotally, it shaves like 6ms off each frame. Unfortunately I'm using the depth-anything-v2 example and the model itself takes like 100ms so the benefits are marginal by comparison.

I'm going to be too busy in the next week to continue but I can share my code if you're interested.

Aug 15 '25 02:08 kixelated

Oh and to clarify, yeah I've also got the output gpu-buffer working but I think it could be simplified (claude had trouble). It would also be faster with buffer reuse, instead of preferredOutputLocation: 'gpu-buffer'.

Aug 15 '25 02:08 kixelated

Hi there 👋 thanks for your initial exploration! Using WebCodecs is definitely something I would like to add in future -- I actually have looked into it in the past, but since it requires additional libraries like mp4box.js for demuxing, and since we haven't added many video tasks, I turned my attention to other things.

Directly consuming a video/image/canvas element by the pipeline API would also be a great UX improvement, as we could hide a lot of this complexity from the user.

Keep me up to date with your findings when you get some time to look into it again :) thanks for your contributions!

Aug 16 '25 04:08 xenova

There's an amazing library that was recently released called mediabunny: https://mediabunny.dev/ (https://github.com/Vanilagy/mediabunny), which could form the basis of our implementation. Definitely something to look into.

bundle sizes are also pretty small 👍

Sep 30 '25 20:09 xenova