transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

[WIP] 🚀🚀🚀 Transformers.js V3 🚀🚀🚀

Open xenova opened this issue 1 year ago • 36 comments

In preparation for Transformers.js v3, I'm compiling a list of issues/features which will be fixed/included in the release.

  • [x] WebGPU support (upgrade onnxruntime-web to 1.17.0).
    • closes #20
    • closes #73
    • closes #79
    • closes #100
    • closes #119
    • closes #298
    • closes #377
    • closes #405
    • closes #505
    • closes #533
  • [ ] Fix logging/disable warnings (onnxruntime-web → 1.17.0). Closes:
    • closes #270
    • closes #529
  • [ ] Fix WASM backend for large models (onnxruntime-web → 1.17.0). Closes:
    • closes #499
  • [x] Deno support (upgrade sharp.js to 0.33.x). Closes:
    • closes #78
    • closes #541
  • [x] CommonJS compatibility. Closes #152
  • [ ] Skip local model check when running in-browser, unless explicitly set by user. This is an issue experienced by many beginners, where requests made to localhost redirect to error page, but dev server incorrectly returns status code 200. Closes
    • closes #328
    • closes #366
  • [ ] Improve unit test suite and allow local testing. Closes #491
  • [ ] Create model conversion space. Closes
    • closes #222
    • closes #274
  • [ ] Upgrade conversion script dependency versions (+fixes sentence-transformers conversions). Closes
    • closes #230
  • [ ] Versioned documentation, so that users still on v2 will be able to access the correct documentation.
  • [ ] Improve PretrainedModel, PretrainedTokenizer, and Processor types. In a similar way to how the pipeline API has conditional types, we'll add the same for the other classes accessible by users.
  • [ ] Consistency issues:
    • [x] topk -> top_k parameter.
    • [x] Tensor transpose -> permute
    • [ ] Like the python library, encapsulate multimodal preprocessing in the AutoProcessor class, which uses image processor and tokenizer
  • [ ] Improved model caching of revisions, and purge cache if model files are out-of-date. Ideally, this should be enabled by a setting to avoid unnecessary network calls to check versions if the files are present in cache.
  • [ ] Improve pipeline fallback errors
    • closes #314
  • [x] WebNN support
    • reference: https://github.com/xenova/transformers.js/pull/608

How to use WebGPU

First, install the development branch

npm install xenova/transformers.js#v3

Then specify the device parameter when loading the model. Here's example code to get started. Please note that this is still a WORK IN PROGRESS, so the following usage may change before release.

import { pipeline } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2', {
    device: 'webgpu',
    dtype: 'fp32', // or 'fp16'
});

// Generate embeddings
const sentences = ['That is a happy person', 'That is a very happy person'];
const output = await extractor(sentences, { pooling: 'mean', normalize: true });
console.log(output.tolist());

xenova avatar Jan 27 '24 17:01 xenova

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Hey! This is great. Is this already in alpha?

Huguet57 avatar Jan 31 '24 00:01 Huguet57

Team, is there any tentative time to release this v3 alpha ???

kishorekaruppusamy avatar Feb 06 '24 11:02 kishorekaruppusamy

I can't wait anymore :) Please update me when it will be released!

jhpassion0621 avatar Feb 12 '24 17:02 jhpassion0621

@xenova Can I test v3-alpha by using NPM? When I try to run, I get this issue. Screenshot 2024-02-14 at 6 25 31 PM

jhpassion0621 avatar Feb 14 '24 10:02 jhpassion0621

@xenova Can I test v3-alpha by using NPM? When I try to run, I get this issue. Screenshot 2024-02-14 at 6 25 31 PM

use this https://github.com/kishorekaruppusamy/transformers.js/commit/7af8ef1e5c37f3052ed3a8e38938595702836f09 commit to resolve this issue ...

kishorekaruppusamy avatar Feb 14 '24 11:02 kishorekaruppusamy

Thanks for your reply @kishorekaruppusamy I tried with your branch and I got other issues. Screenshot 2024-02-15 at 3 59 28 PM Please give me your advise!

jhpassion0621 avatar Feb 15 '24 08:02 jhpassion0621

Thanks for your reply @kishorekaruppusamy I tried with your branch and I got other issues. Screenshot 2024-02-15 at 3 59 28 PM Please give me your advise!

https://github.com/kishorekaruppusamy/transformers.js/blob/V3_BRANCH_WEBGPU_BUG_FIX/src/backends/onnx.js#L144 change this url to local dist dir inside build ..

kishorekaruppusamy avatar Feb 15 '24 08:02 kishorekaruppusamy

Thanks @kishorekaruppusamy I downloaded the latest wasm from onnxruntime and added in local directory but I got same issue

Screenshot 2024-02-15 at 9 39 10 PM

I realized transformer.js v3 uses onnxruntime 1.16.3 so I created wasm by using onnxruntime 1.16.3 and tested and I got same issue

please give your advise. Thanks

jhpassion0621 avatar Feb 15 '24 13:02 jhpassion0621

@xenova it looks like #596 is part of this release?! I think that means onnx_data files will be supported?

If true, I'm stoked!

Beyond upgrading ort to 1.17, are there other changes needed to support models with onnx_data files? Happy to try to lend a hand if possible

NawarA avatar Mar 06 '24 22:03 NawarA

Hi everyone! Today we released our first WebGPU x Transformers.js demo: The WebGPU Embedding Benchmark (online demo). If you'd like to help with testing, please run the benchmark and share your results! Thanks!

webgpu-benchmark

xenova avatar Mar 09 '24 01:03 xenova

@xenova can this bench pick the GPU 1 instead of 0? For the laptops with dGPU

khmyznikov avatar Mar 11 '24 18:03 khmyznikov

@xenova can this bench pick the GPU 1 instead of 0? For the laptops with dGPU

Not currently, but this is being worked on here: https://github.com/microsoft/onnxruntime/pull/19857. We will add support here once ready.

xenova avatar Mar 11 '24 22:03 xenova

@beaufortfrancois - I've added the source code for the video background removal demo. On my device, I get ~20fps w/ WebGPU support (w/ fp32 since fp16 is broken). Here's a screen recording (which drops my fps to ~14):

https://github.com/xenova/transformers.js/assets/26504141/3d462173-2a23-4a53-a146-a6dbb0e335c7

  • Model used: https://huggingface.co/Xenova/modnet (~4 years old, and it clearly struggles on hands moving quickly). I will try on more up-to-date models soon.
  • Video tested: https://www.youtube.com/watch?v=NXpdyAWLDas
  • Online demo: https://huggingface.co/spaces/Xenova/webgpu-video-background-removal

xenova avatar Mar 13 '24 18:03 xenova

@beaufortfrancois - I've added the source code for the video background removal demo. On my device, I get ~20fps w/ WebGPU support (w/ fp32 since fp16 is broken). Here's a screen recording (which drops my fps to ~14):

You rock. Thanks! It's a cool demo! 👍

I've been wondering how we could improve it:

  • I've noticed you read the current frame of the video on the main thread. Would it help to move the entire demo to a web worker?
  • output[0].mul(255).to('uint8') takes some non negligible time to run. Is there a faster path?
  • How much you expect fp16 to improve perf? In https://developer.chrome.com/blog/new-in-webgpu-120#support_for_16-bit_floating-point_values_in_wgsl, we've noticed on an Apple M1 Pro device that the f16 implementation of Llama2 7B models used in the WebLLM chat demo is significantly faster than the f32 implementation, with a 28% improvement in prefill speed and a 41% improvement in decoding speed.
  • A way to feed a GPUExternalTexture to the model as an input could also come handy.

beaufortfrancois avatar Mar 14 '24 13:03 beaufortfrancois

device: 'webgpu',

For some environement it better be list. Because there not all execution proveders support all oprtators. For my use-case I'm give a list of EP orderd by priority, let onnxruntime auto fallback. For example: ['nnapi', 'xnnpack', 'cpu'] for Android / ['qnn', 'dml', 'xnnpack', 'cpu'] for Windows ARM64 (custom build)

hans00 avatar Mar 18 '24 09:03 hans00

UPDATE: Looks like some kernels are not supported for quant operations :/

I tested WebGPU version on https://huggingface.co/Xenova/wav2vec2-bert-CV16-en with changes from v3 and model (quantized) is loaded without errors but after running transcribe it throws error with message: An error occurred during model execution: "Error: [WebGPU] Kernel "[Split] /wav2vec2_bert/encoder/layers.0/conv_module/glu/Split" failed. Error: no GPU data for output: 0".

[E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Split node. Name:'/wav2vec2_bert/encoder/layers.0/conv_module/glu/Split' Status Message: Failed to run JSEP kernel

Is it some quantization error or onnxruntime error?

Logs localhost-1710758687772.log Env: Windows, Chrome 122, Nvidia Geforce 3090

young-developer avatar Mar 18 '24 10:03 young-developer

@young-developer Thanks for the report. I will cc @guschmue for this unsupported operator. It may already be fixed in the dev branch of onnxruntime-web.

@hans00 For more advanced use-cases, you can update the session options directly with session_options: {...} in the model options.

xenova avatar Mar 18 '24 16:03 xenova

FYI @xenova I was able to load the model in fp32 and got the same error. I also tried to load in fp16 but it throws input error is (float) instead of (float16) so I assume inputs should be converted to fp16 too.

young-developer avatar Mar 20 '24 09:03 young-developer

Exciting news 🥳 We've got Musicgen working! Example usage:

import { AutoTokenizer, MusicgenForConditionalGeneration } from '@xenova/transformers';

// Load tokenizer and model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/musicgen-small');
const model = await MusicgenForConditionalGeneration.from_pretrained(
  'Xenova/musicgen-small', { dtype: 'fp32' }
);

// Prepare text input
const prompt = '80s pop track with bassy drums and synth';
const inputs = tokenizer(prompt);

// Generate audio
const audio_values = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: true,
  guidance_scale: 3,
});

// (Optional) Write the output to a WAV file
import wavefile from 'wavefile';
import fs from 'fs';

const wav = new wavefile.WaveFile();
wav.fromScratch(1, model.config.audio_encoder.sampling_rate, '32f', audio_values.data);
fs.writeFileSync('musicgen_out.wav', wav.toBuffer());

Samples:

https://github.com/xenova/transformers.js/assets/26504141/558855d3-5f30-41fa-85c4-0c530bdabd98

https://github.com/xenova/transformers.js/assets/26504141/80e8785c-ffdf-4181-9fb0-010f69c72a06

https://github.com/xenova/transformers.js/assets/26504141/1b66801d-43c5-49cd-9ae4-fea2ca3f23d8

xenova avatar Apr 06 '24 21:04 xenova

Would it be helpful if I created an example for MusicGen? (based on your example code, but as a small stand-along html page)

flatsiedatsie avatar Apr 07 '24 12:04 flatsiedatsie

@xenova There is new version 1.17.3 of onnxruntime-web. I tested with wav2vec and there is new error so looks like a progress 😄

young-developer avatar Apr 11 '24 12:04 young-developer

Segment Anything Encoder now works with WebGPU: up to 8x faster! (online demo)

https://github.com/xenova/transformers.js/assets/26504141/340463a5-69d9-47c6-b392-429ca9a9b205

xenova avatar Apr 21 '24 00:04 xenova

Phi-3 WebGPU support is now working! Demo: https://huggingface.co/spaces/Xenova/experimental-phi3-webgpu

https://github.com/xenova/transformers.js/assets/26504141/6c42e61b-f381-4835-bf63-f37cc752a16b

xenova avatar May 08 '24 15:05 xenova

Does anyone have a guide for how to get this bundled into a script, akin to a JSDelivr URL? Here's what I tried:

// index.js
export * from 'transformers.js'; // Adjust if the import path differs
npm install  xenova/transformers.js#v3
npm install rollup @rollup/plugin-node-resolve rollup-plugin-terser --save-dev
// rollup.config.js
import resolve from '@rollup/plugin-node-resolve';
import { terser } from 'rollup-plugin-terser';

export default {
  input: 'index.js',
  output: {
    file: 'bundle.js',
    format: 'esm',
    sourcemap: true
  },
  plugins: [
    resolve({
      browser: true,
    }),
    terser()
  ]
};

And in package.json:

"scripts": {
  "build": "rollup -c"
}

And then:

npm run build

And that produced a bundle.js, but it was looking for webgpu.proxy.min.js on jsdelivr, which doesn't exist where it was looking. I tried manually adjusting to URL in the bundle to point to the ort.webgpu.min.js file, but no luck (I also tried esm/ort.webgpu.min.js). I'm guessing there are some tricky things due to the dynamic nature of backend loading that bundlers struggle to automatically pick up.

@xenova Alternatively, I wonder if you'd be able to do some v3 alpha/prealpha releases via github tags so that jsdelivr picks them up? Since there's no way (IIUC) to simply reference a branch via jsdelivr (due to immutability requirement I assume).

josephrocca avatar May 11 '24 12:05 josephrocca

The latest commits add support for Moondream2, a small vision language model by @vikhyat designed to run efficiently on edge devices.

Try it out yourself with the live demo: https://huggingface.co/spaces/Xenova/experimental-moondream-webgpu

https://github.com/xenova/transformers.js/assets/26504141/933651c2-8a32-42e4-b349-57ab66cd1b47

Usage:

import { AutoProcessor, AutoTokenizer, Moondream1ForConditionalGeneration, RawImage } from '@xenova/transformers';

// Load processor, tokenizer and model
const model_id = 'Xenova/moondream2';
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const model = await Moondream1ForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32'
        vision_encoder: 'fp16', // or 'q8'
        decoder_model_merged: 'q4', // or 'q4f16' or 'q8'
    },
    device: 'webgpu',
});

// Prepare text inputs
const prompt = 'Describe this image.';
const text = `<image>\n\nQuestion: ${prompt}\n\nAnswer:`;
const text_inputs = tokenizer(text);

// Prepare vision inputs
const url = 'https://huggingface.co/vikhyatk/moondream1/resolve/main/assets/demo-1.jpg';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const output = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
});
const decoded = tokenizer.batch_decode(output, { skip_special_tokens: false });
console.log(decoded);
// [
//     '<|endoftext|><image>\n\n' +
//     'Question: Describe this image.\n\n' +
//     'Answer: A hand is holding a white book titled "The Little Book of Deep Learning" against a backdrop of a balcony with a railing and a view of a building and trees.<|endoftext|>'
// ]

xenova avatar May 17 '24 13:05 xenova

VLMs now support PKV caching. Demo: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu

https://github.com/xenova/transformers.js/assets/26504141/b8b10e8b-22c7-4942-a846-05bfa472a4e7

Example code
import { AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration, RawImage } from '@xenova/transformers';

// Load tokenizer, processor and model
const model_id = 'Xenova/nanoLLaVA';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32' or 'q8'
        vision_encoder: 'fp16', // or 'fp32' or 'q8'
        decoder_model_merged: 'q4', // or 'q8'
    },
    // device: 'webgpu',
});

// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const { past_key_values, sequences } = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
    return_dict_in_generate: true,
});

// Decode output
const answer = tokenizer.decode(
    sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(answer);
// The text reads "Small but mighty".

const new_messages = [
    ...messages,
    { role: 'assistant', content: answer },
    { role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);

// Generate another response
const output = await model.generate({
    ...new_text_inputs,
    past_key_values,
    do_sample: false,
    max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
    output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(new_answer);
// The context of the image is that of a playful and humorous illustration of a mouse holding a weightlifting bar. The text "Small but mighty" is a playful reference to the mouse's size and strength.

xenova avatar May 18 '24 23:05 xenova

@xenova For some models, the performance may be a blocker. Since model downloads can be quite large, I wonder if there should be a way for web developers to know their machine performance class for running a model without downloading it completely first.

I believe this would involve running the model code with zeroed-out weights, which would still require buffer allocations but would allow the web app to catch out-of-memory errors or such. The model architecture would still needed to generate shaders, but this be much smaller than model weights.

Essentially, knowing the model architecture and testing with empty weights would allow for assessing performance capability without downloading the full model.

I thought I could use from_config for that but I wonder now if this should be a built-in V3 feature. What are your thoughts?

beaufortfrancois avatar Jun 04 '24 12:06 beaufortfrancois

@beaufortfrancois That would be amazing to have! Although, it's probably best suited as a feature request for onnxruntime-web. The way one could do it is to use the external data format to save models into two parts: graph-only (<1MB usually) and weights, and then initialize an empty session from the graph without loading the weights. @guschmue might have additional insights.

xenova avatar Jun 05 '24 17:06 xenova

Thank you @xenova for your support ❤️

@guschmue What are your thoughts on https://github.com/xenova/transformers.js/pull/545#issuecomment-2150547374? I'm happy to file a feature request in https://github.com/microsoft/onnxruntime

beaufortfrancois avatar Jun 07 '24 07:06 beaufortfrancois