WebGPU crash on Android Chrome running SmolVLM-256M-Instruct
System Info
transformers.js 3.3.3 (via https://cdn.jsdelivr.net/npm/@huggingface/[email protected]) Platform Android 13 Chrome for Android 133.0.6943.50
webgpureport attached
webgpureport-2025-02-22T06-59-48-900Z.txt
Environment/Platform
- [x] Website/web-app
- [ ] Browser extension
- [ ] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)
Description
I successfully run SmolVLM2-256-Instruct on my development machine (webgpu enabled). I saw a classic 10x improve over WASM. However, I have an error when I try to use the same code on target device (Android).
I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".
Chrome console output:
WebGL: CONTEXT_LOST_WEBGL: loseContext: context lostUnderstand this warning
A valid external Instance reference no longer exists.
Uncaught (in promise) AbortError: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists.
I supposed the problem was fixed based on https://github.com/huggingface/transformers.js/issues/943
Any idea?
Reproduction
Code to Reproduce
import {
AutoProcessor,
AutoModelForVision2Seq,
load_image,
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
console.log("vlm.js");
const DEBUG_MODE = true;
globalThis.whatsInTheImage = async function (imagePath) {
console.log(imagePath);
// Track execution times
const timings = {};
function logTime(label) {
if (DEBUG_MODE){
const now = performance.now();
if (!timings[label]) {
timings[label] = now;
} else {
console.log(`${label} took ${(now - timings[label]).toFixed(2)}ms`);
delete timings[label];
}
}
}
// Load image
logTime("Image Loading");
const image1 = await load_image(imagePath);
logTime("Image Loading");
// Load processor and model
const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
logTime("Processor Loading");
const processor = await AutoProcessor.from_pretrained(model_id);
logTime("Processor Loading");
logTime("Model Loading");
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
dtype: {
embed_tokens: "fp32",
vision_encoder: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
});
logTime("Model Loading");
// Prepare input messages
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: "Can you describe this artistic image?" },
],
},
];
// Process text
logTime("Text Processing");
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
logTime("Text Processing");
logTime("Processor Apply");
const inputs = await processor(text, [image1], {
do_image_splitting: false,
});
logTime("Processor Apply");
// Generate output
logTime("Model Generation");
const generated_ids = await model.generate({
...inputs,
max_new_tokens: 500,
});
logTime("Model Generation");
logTime("Batch Decoding");
const generated_texts = processor.batch_decode(
generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true },
);
logTime("Batch Decoding");
return generated_texts[0];
};
I tried
embed_tokens: "fp16"without success (no support from the target device), then switched toembed_tokens: "fp32".
Just a note on this topic, I saw that webgpureport on Android platform is missing the shaders-f16 feature.
The same feature is listed in the development machine.
I suppose this is the source of the error at runtime.
It could make sense to check the availability of this option before breaking the code. Check this: https://developer.chrome.com/blog/new-in-webgpu-120
const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("shader-f16")) {
throw new Error("16-bit floating-point value support is not available");
}
// Explicitly request 16-bit floating-point value support.
const device = await adapter.requestDevice({
requiredFeatures: ["shader-f16"],
});
const code = `
enable f16;
@compute @workgroup_size(1)
fn main() {
const c : vec3h = vec3<f16>(1.0h, 2.0h, 3.0h);
}
`;
const shaderModule = device.createShaderModule({ code });
// Create a compute pipeline with this shader module
// and run the shader on the GPU...`
Hitting the same issue: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists. with Florence 2 example also in browser with [email protected]
Update: I make it work by skip the overlay provided by transformers.js (apart from preprocessing). I directly used onnxruntime-web, even tough the generated tokens have no sense (support to understand the bug is appreciated).
I definetively think this is the proof of a bug in transformer.js, let's see if there is time to find and patch with a PR.
This working code:
import {
AutoProcessor,
load_image,
AutoConfig
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';
class SmolVLMInference {
constructor(config) {
// Model configuration
this.modelId = "HuggingFaceTB/SmolVLM-256M-Instruct";
this.config = {
text_config: {
num_key_value_heads: config.text_config.num_key_value_heads,
head_dim: config.text_config.head_dim,
num_hidden_layers: config.text_config.num_hidden_layers,
eos_token_id: config.text_config.eos_token_id
},
image_token_id: config.image_token_id
};
// Initialize sessions and processor
this.visionSession = null;
this.embedSession = null;
this.decoderSession = null;
this.processor = null;
// Model parameters from config
this.numKeyValueHeads = this.config.text_config.num_key_value_heads;
this.headDim = this.config.text_config.head_dim;
this.numHiddenLayers = this.config.text_config.num_hidden_layers;
this.eosTokenId = this.config.text_config.eos_token_id;
this.imageTokenId = this.config.image_token_id;
}
// Initialize ONNX sessions
async loadModels() {
try {
console.log("Loading ONNX models...");
// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'] }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'] }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'] })
]);
console.log("Models loaded successfully!");
return true;
} catch (error) {
console.error("Error loading models:", error);
return false;
}
}
// Simplified token decoder
decodeTokens(tokens) {
// This is a very simplified decoder
return tokens.map(t => String.fromCharCode(97 + (Number(t) % 26))).join("");
}
async officialPreproc(imageUrl, question){
const image1 = await load_image(imageUrl);
// Load processor and model
const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: question },
],
},
];
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1], {
do_image_splitting: false,
});
return inputs;
}
// Main inference function
async generateText(imageUrl, question, maxNewTokens = 1024) {
try {
const officialInputProcessing = await this.officialPreproc(imageUrl, question);
// Prepare decoder inputs
const batchSize = 1;
let pastKeyValues = {};
for (let layer = 0; layer < this.numHiddenLayers; layer++) {
for (let kv of ['key', 'value']) {
pastKeyValues[`past_key_values.${layer}.${kv}`] = new ort.Tensor(
'float32',
new Float32Array(0),
[batchSize, this.numKeyValueHeads, 0, this.headDim]
);
}
}
let imageFeatures = null;
let inputIds = officialInputProcessing.input_ids;
let attentionMask = officialInputProcessing.attention_mask;
// Calculate position IDs
let positionIds = this.calculatePositionIds(attentionMask);
// Generation loop
let generatedTokens = [];
let outputText = "";
console.log("Starting generation...");
for (let i = 0; i < maxNewTokens; i++) {
// Get token embeddings
const inputIdsArray = this.getTensorData(inputIds);
const embedFeed = { 'input_ids': inputIds };
const embedResult = await this.embedSession.run(embedFeed);
const inputsEmbeds = embedResult.inputs_embeds; // Assumes output tensor is named 'output'
// Process image if needed
if (imageFeatures === null) {
const visionFeed = {
'pixel_values': officialInputProcessing.pixel_values,
'pixel_attention_mask': officialInputProcessing.pixel_attention_mask
};
const visionResult = await this.visionSession.run(visionFeed);
imageFeatures = visionResult.image_features;
}
// Run decoder model
const decoderFeeds = {
'inputs_embeds': inputsEmbeds,
'attention_mask': attentionMask,
'position_ids': positionIds,
...pastKeyValues
};
const decoderResults = await this.decoderSession.run(decoderFeeds);
const logits = decoderResults.logits;
const presentKeyValues = decoderResults.present_key_values || [];
// Get next token (argmax of last logits)
const nextToken = this.getNextToken(logits);
// Update for next iteration
inputIds = new ort.Tensor('int64', new BigInt64Array([BigInt(nextToken)]), [1, 1]);
attentionMask = new ort.Tensor('int64', new BigInt64Array([1n]), [1, 1]);
positionIds = new ort.Tensor('int64', new BigInt64Array([BigInt(this.getTensorData(positionIds)[0] + BigInt(1))]), [1, 1]);
// Update past key values
// This would need proper handling of the present key values structure
// Add token to generated sequence
generatedTokens.push(nextToken);
// Decode token and add to output text
const tokenText = this.decodeTokens([nextToken]);
outputText += tokenText;
// Optional streaming output
if (i % 5 === 0) {
console.log("Generation progress:", outputText);
}
// Check for EOS token
if (nextToken === this.eosTokenId) {
break;
}
}
console.log("Generation complete!");
return outputText;
} catch (error) {
console.error("Error in generation:", error);
return "An error occurred during text generation.";
}
}
// Helper to calculate position IDs from attention mask
calculatePositionIds(attentionMask) {
const attentionArray = this.getTensorData(attentionMask);
const positionArray = new BigInt64Array(attentionArray.length);
let position = 0n;
for (let i = 0; i < attentionArray.length; i++) {
if (attentionArray[i] === 1n) {
positionArray[i] = BigInt(position);
position++;
} else {
positionArray[i] = 0n;
}
}
return new ort.Tensor('int64', positionArray, attentionMask.dims);
}
// Helper to get next token from logits
getNextToken(logits) {
// Get the last token's logits
const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
// Find the index of the maximum value (argmax)
let maxIndex = 0;
let maxValue = lastLogits[0];
for (let i = 1; i < lastLogits.length; i++) {
if (lastLogits[i] > maxValue) {
maxValue = lastLogits[i];
maxIndex = i;
}
}
return maxIndex;
}
// Helper to get tensor data as array
getTensorData(tensor) {
return tensor.data;
}
}
// Usage example
async function runSmolVLM() {
let model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
const config = await AutoConfig.from_pretrained(model_id);
const inferenceEngine = new SmolVLMInference(config);
// Step 1: Load models
const modelsLoaded = await inferenceEngine.loadModels();
if (!modelsLoaded) {
console.error("Failed to load models");
return;
}
// Step 2: Run inference
const imageUrl = "./Statue-of-Liberty-Island-New-York-Bay.jpg";
const question = "Can you describe this image?";
console.log("Running inference on image:", imageUrl);
console.log("Question:", question);
const result = await inferenceEngine.generateText(imageUrl, question);
// Step 3: Show results
console.log("Generated text:");
console.log(result);
// Display in UI if needed
if (document.getElementById('result')) {
document.getElementById('result').textContent = result;
}
}
// Add this at the bottom of your smolvlm.js file
export { SmolVLMInference, runSmolVLM };
<!DOCTYPE html>
<html>
<head>
<title>SmolVLM Demo</title>
<script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.20.1/ort.webgpu.min.js"></script>
<script type="module" src="smolvlm.js"></script>
</head>
<body>
<h1>SmolVLM Image Captioning</h1>
<button id="runButton">Run Model</button>
<div id="result"></div>
<script type="module">
// Import the function from your module
import { runSmolVLM } from './smolvlm.js';
// Add event listener to button
document.getElementById('runButton').addEventListener('click', async () => {
try {
await runSmolVLM();
} catch (error) {
console.error("Error running SmolVLM:", error);
document.getElementById('result').textContent = "Error: " + error.message;
}
});
</script>
</body>
</html>
One "optimization" which transformers.js adds is to use preferredOutputLocation to keep the kv cache on GPU between forward passes: https://onnxruntime.ai/docs/api/js/interfaces/InferenceSession.SessionOptions.html#preferredOutputLocation
Maybe try add that to your sample code to see whether this is the cause of the issue?
@xenova I suppose you talk about this: https://github.com/huggingface/transformers.js/blob/c2ab81af062d32ad46e892e7ea5c554ca14117de/src/models.js#L297
I tried to globally set gpu-buffer in this way:
// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
]);
An issue is raised because I try to access logits data in GPU while it should be in CPU, and it happens here:
const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
Is the same issue hitting transformers.js?
Error in generation: Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.
FYI a working example of SmolVLM, transformers.js and onnx can be found here https://www.gradients.zone/blog/multimodal-llms-at-the-edge/