transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

WebGPU crash on Android Chrome running SmolVLM-256M-Instruct

Open sbrzz opened this issue 10 months ago • 6 comments

System Info

transformers.js 3.3.3 (via https://cdn.jsdelivr.net/npm/@huggingface/[email protected]) Platform Android 13 Chrome for Android 133.0.6943.50

webgpureport attached

webgpureport-2025-02-22T06-59-48-900Z.txt

Environment/Platform

  • [x] Website/web-app
  • [ ] Browser extension
  • [ ] Server-side (e.g., Node.js, Deno, Bun)
  • [ ] Desktop app (e.g., Electron)
  • [ ] Other (e.g., VSCode extension)

Description

I successfully run SmolVLM2-256-Instruct on my development machine (webgpu enabled). I saw a classic 10x improve over WASM. However, I have an error when I try to use the same code on target device (Android).

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Chrome console output:

WebGL: CONTEXT_LOST_WEBGL: loseContext: context lostUnderstand this warning A valid external Instance reference no longer exists. Uncaught (in promise) AbortError: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists.

I supposed the problem was fixed based on https://github.com/huggingface/transformers.js/issues/943

Any idea?

Reproduction

Code to Reproduce

import { 
    AutoProcessor,
    AutoModelForVision2Seq,
    load_image,
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

console.log("vlm.js");

const DEBUG_MODE = true;

globalThis.whatsInTheImage = async function (imagePath) {
    console.log(imagePath);

    // Track execution times
    const timings = {};
    function logTime(label) {
        if (DEBUG_MODE){
            const now = performance.now();
            if (!timings[label]) {
                timings[label] = now;
            } else {
                console.log(`${label} took ${(now - timings[label]).toFixed(2)}ms`);
                delete timings[label];
            }
        }
    }

    // Load image
    logTime("Image Loading");
    const image1 = await load_image(imagePath);
    logTime("Image Loading");

    // Load processor and model
    const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

    logTime("Processor Loading");
    const processor = await AutoProcessor.from_pretrained(model_id);
    logTime("Processor Loading");

    logTime("Model Loading");
    const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
        dtype: {
            embed_tokens: "fp32", 
            vision_encoder: "q4", 
            decoder_model_merged: "q4", 
        },
        device: "webgpu",
    });
    logTime("Model Loading");

    // Prepare input messages
    const messages = [
        {
            role: "user",
            content: [
                { type: "image" },
                { type: "text", text: "Can you describe this artistic image?" },
            ],
        },
    ];

    // Process text
    logTime("Text Processing");
    const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
    logTime("Text Processing");

    logTime("Processor Apply");
    const inputs = await processor(text, [image1], {
        do_image_splitting: false,
    });
    logTime("Processor Apply");

    // Generate output
    logTime("Model Generation");
    const generated_ids = await model.generate({
        ...inputs,
        max_new_tokens: 500,
    });
    logTime("Model Generation");

    logTime("Batch Decoding");
    const generated_texts = processor.batch_decode(
        generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]), 
        { skip_special_tokens: true },
    );
    logTime("Batch Decoding");

    return generated_texts[0];
};

sbrzz avatar Feb 22 '25 12:02 sbrzz

I tried embed_tokens: "fp16" without success (no support from the target device), then switched to embed_tokens: "fp32".

Just a note on this topic, I saw that webgpureport on Android platform is missing the shaders-f16 feature. The same feature is listed in the development machine. I suppose this is the source of the error at runtime.

It could make sense to check the availability of this option before breaking the code. Check this: https://developer.chrome.com/blog/new-in-webgpu-120

const adapter = await navigator.gpu.requestAdapter();
if (!adapter.features.has("shader-f16")) {
  throw new Error("16-bit floating-point value support is not available");
}
// Explicitly request 16-bit floating-point value support.
const device = await adapter.requestDevice({
  requiredFeatures: ["shader-f16"],
});

const code = `
  enable f16;

  @compute @workgroup_size(1)
  fn main() {
    const c : vec3h = vec3<f16>(1.0h, 2.0h, 3.0h);
  }
`;

const shaderModule = device.createShaderModule({ code });
// Create a compute pipeline with this shader module
// and run the shader on the GPU...`

sbrzz avatar Feb 23 '25 08:02 sbrzz

Hitting the same issue: Failed to execute 'mapAsync' on 'GPUBuffer': A valid external Instance reference no longer exists. with Florence 2 example also in browser with [email protected]

hacktronics avatar Feb 23 '25 18:02 hacktronics

Update: I make it work by skip the overlay provided by transformers.js (apart from preprocessing). I directly used onnxruntime-web, even tough the generated tokens have no sense (support to understand the bug is appreciated).

I definetively think this is the proof of a bug in transformer.js, let's see if there is time to find and patch with a PR.

This working code:


import { 
  AutoProcessor,
  load_image,
  AutoConfig
} from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

class SmolVLMInference {
    constructor(config) {
      // Model configuration
      this.modelId = "HuggingFaceTB/SmolVLM-256M-Instruct";
      this.config = {
        text_config: {
          num_key_value_heads: config.text_config.num_key_value_heads,
          head_dim: config.text_config.head_dim,
          num_hidden_layers: config.text_config.num_hidden_layers,
          eos_token_id: config.text_config.eos_token_id
        },
        image_token_id: config.image_token_id
      };
      
      // Initialize sessions and processor
      this.visionSession = null;
      this.embedSession = null;
      this.decoderSession = null;
      this.processor = null;
      
      // Model parameters from config
      this.numKeyValueHeads = this.config.text_config.num_key_value_heads;
      this.headDim = this.config.text_config.head_dim;
      this.numHiddenLayers = this.config.text_config.num_hidden_layers;
      this.eosTokenId = this.config.text_config.eos_token_id;
      this.imageTokenId = this.config.image_token_id;
    }
  
    // Initialize ONNX sessions
    async loadModels() {
      try {
        console.log("Loading ONNX models...");
        
        // Load all three models in parallel
        [this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
          ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'] }),
          ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'] })
        ]);
        
        console.log("Models loaded successfully!");
        return true;
      } catch (error) {
        console.error("Error loading models:", error);
        return false;
      }
    }
  
    // Simplified token decoder
    decodeTokens(tokens) {
      // This is a very simplified decoder
      return tokens.map(t => String.fromCharCode(97 + (Number(t) % 26))).join("");
    }

    async officialPreproc(imageUrl, question){

      const image1 = await load_image(imageUrl);

      // Load processor and model
      const model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";

      const processor = await AutoProcessor.from_pretrained(model_id);

      const messages = [
          {
              role: "user",
              content: [
                  { type: "image" },
                  { type: "text", text: question },
              ],
          },
      ];
      const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
      const inputs = await processor(text, [image1], {
        do_image_splitting: false,
      });

      return inputs;
    }
  
    // Main inference function
    async generateText(imageUrl, question, maxNewTokens = 1024) {
      try {

        const officialInputProcessing = await this.officialPreproc(imageUrl, question);
        
        // Prepare decoder inputs
        const batchSize = 1;
        let pastKeyValues = {};
        for (let layer = 0; layer < this.numHiddenLayers; layer++) {
          for (let kv of ['key', 'value']) {
            pastKeyValues[`past_key_values.${layer}.${kv}`] = new ort.Tensor(
              'float32', 
              new Float32Array(0), 
              [batchSize, this.numKeyValueHeads, 0, this.headDim]
            );
          }
        }
        
        let imageFeatures = null;
        let inputIds = officialInputProcessing.input_ids;
        let attentionMask = officialInputProcessing.attention_mask;
        
        // Calculate position IDs
        let positionIds = this.calculatePositionIds(attentionMask);
        
        // Generation loop
        let generatedTokens = [];
        let outputText = "";
        
        console.log("Starting generation...");
        
        for (let i = 0; i < maxNewTokens; i++) {
          // Get token embeddings
          const inputIdsArray = this.getTensorData(inputIds);
          const embedFeed = { 'input_ids': inputIds };
          const embedResult = await this.embedSession.run(embedFeed);
          const inputsEmbeds = embedResult.inputs_embeds; // Assumes output tensor is named 'output'
          
          // Process image if needed
          if (imageFeatures === null) {
            const visionFeed = {
              'pixel_values': officialInputProcessing.pixel_values,
              'pixel_attention_mask': officialInputProcessing.pixel_attention_mask
            };
            
            const visionResult = await this.visionSession.run(visionFeed);
            imageFeatures = visionResult.image_features;
            
          }
          
          // Run decoder model
          const decoderFeeds = {
            'inputs_embeds': inputsEmbeds,
            'attention_mask': attentionMask,
            'position_ids': positionIds,
            ...pastKeyValues
          };
          
          const decoderResults = await this.decoderSession.run(decoderFeeds);
          const logits = decoderResults.logits;
          const presentKeyValues = decoderResults.present_key_values || [];
          
          // Get next token (argmax of last logits)
          const nextToken = this.getNextToken(logits);
          
          // Update for next iteration
          inputIds = new ort.Tensor('int64', new BigInt64Array([BigInt(nextToken)]), [1, 1]);
          attentionMask = new ort.Tensor('int64', new BigInt64Array([1n]), [1, 1]);
          positionIds = new ort.Tensor('int64', new BigInt64Array([BigInt(this.getTensorData(positionIds)[0] + BigInt(1))]), [1, 1]);
          
          // Update past key values
          // This would need proper handling of the present key values structure
          
          // Add token to generated sequence
          generatedTokens.push(nextToken);
          
          // Decode token and add to output text
          const tokenText = this.decodeTokens([nextToken]);
          outputText += tokenText;
          
          // Optional streaming output
          if (i % 5 === 0) {
            console.log("Generation progress:", outputText);
          }
          
          // Check for EOS token
          if (nextToken === this.eosTokenId) {
            break;
          }
        }
        
        console.log("Generation complete!");
        return outputText;
      } catch (error) {
        console.error("Error in generation:", error);
        return "An error occurred during text generation.";
      }
    }
  
    // Helper to calculate position IDs from attention mask
    calculatePositionIds(attentionMask) {
      const attentionArray = this.getTensorData(attentionMask);
      const positionArray = new BigInt64Array(attentionArray.length);
      
      let position = 0n;
      for (let i = 0; i < attentionArray.length; i++) {
        if (attentionArray[i] === 1n) {
          positionArray[i] = BigInt(position);
          position++;
        } else {
          positionArray[i] = 0n;
        }
      }
      
      return new ort.Tensor('int64', positionArray, attentionMask.dims);
    }
  
    // Helper to get next token from logits
    getNextToken(logits) {
      // Get the last token's logits
      const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));
      
      // Find the index of the maximum value (argmax)
      let maxIndex = 0;
      let maxValue = lastLogits[0];
      
      for (let i = 1; i < lastLogits.length; i++) {
        if (lastLogits[i] > maxValue) {
          maxValue = lastLogits[i];
          maxIndex = i;
        }
      }
      
      return maxIndex;
    }
  
    // Helper to get tensor data as array
    getTensorData(tensor) {
      return tensor.data;
    }
  }
  
  // Usage example
  async function runSmolVLM() {

    let model_id = "HuggingFaceTB/SmolVLM-256M-Instruct";
    const config = await AutoConfig.from_pretrained(model_id);
    const inferenceEngine = new SmolVLMInference(config);
    
    // Step 1: Load models
    const modelsLoaded = await inferenceEngine.loadModels();
    if (!modelsLoaded) {
      console.error("Failed to load models");
      return;
    }
    
    // Step 2: Run inference
    const imageUrl = "./Statue-of-Liberty-Island-New-York-Bay.jpg";
    const question = "Can you describe this image?";
    
    console.log("Running inference on image:", imageUrl);
    console.log("Question:", question);
    
    const result = await inferenceEngine.generateText(imageUrl, question);
    
    // Step 3: Show results
    console.log("Generated text:");
    console.log(result);
    
    // Display in UI if needed
    if (document.getElementById('result')) {
      document.getElementById('result').textContent = result;
    }
  }

// Add this at the bottom of your smolvlm.js file
export { SmolVLMInference, runSmolVLM };

<!DOCTYPE html>
<html>
<head>
  <title>SmolVLM Demo</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.20.1/ort.webgpu.min.js"></script>
  <script type="module" src="smolvlm.js"></script>
</head>
<body>
    <h1>SmolVLM Image Captioning</h1>
    <button id="runButton">Run Model</button>
    <div id="result"></div>

    <script type="module">
    // Import the function from your module
    import { runSmolVLM } from './smolvlm.js';
    
    // Add event listener to button
    document.getElementById('runButton').addEventListener('click', async () => {
        try {
        await runSmolVLM();
        } catch (error) {
        console.error("Error running SmolVLM:", error);
        document.getElementById('result').textContent = "Error: " + error.message;
        }
    });
    </script>
</body>
</html>

sbrzz avatar Feb 25 '25 13:02 sbrzz

One "optimization" which transformers.js adds is to use preferredOutputLocation to keep the kv cache on GPU between forward passes: https://onnxruntime.ai/docs/api/js/interfaces/InferenceSession.SessionOptions.html#preferredOutputLocation

Maybe try add that to your sample code to see whether this is the cause of the issue?

xenova avatar Feb 25 '25 14:02 xenova

@xenova I suppose you talk about this: https://github.com/huggingface/transformers.js/blob/c2ab81af062d32ad46e892e7ea5c554ca14117de/src/models.js#L297

I tried to globally set gpu-buffer in this way:


// Load all three models in parallel
[this.visionSession, this.embedSession, this.decoderSession] = await Promise.all([
ort.InferenceSession.create('./vision_encoder_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./embed_tokens_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
ort.InferenceSession.create('./decoder_model_merged_q4.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }),
]);

An issue is raised because I try to access logits data in GPU while it should be in CPU, and it happens here:

const lastLogits = Array.from(this.getTensorData(logits).slice(-logits.dims[2]));

Is the same issue hitting transformers.js?

Error in generation: Error: The data is not on CPU. Use `getData()` to download GPU data to CPU, or use `texture` or `gpuBuffer` property to access the GPU data directly.

sbrzz avatar Feb 25 '25 15:02 sbrzz

FYI a working example of SmolVLM, transformers.js and onnx can be found here https://www.gradients.zone/blog/multimodal-llms-at-the-edge/

sbrzz avatar Mar 14 '25 02:03 sbrzz