transformers.js
transformers.js copied to clipboard
Error using Xenova/nanoLLaVA in pipeline
System Info
Using:
- Node v21.7.1
- Mac M1
Environment/Platform
- [X] Website/web-app
- [X] Server-side (e.g., Node.js, Deno, Bun)
Description
https://huggingface.co/Xenova/nanoLLaVA
New model nanoLLaVA threw this error:
Unknown model class "llava", attempting to construct from base class.
Model type for 'llava' not found, assuming encoder-only architecture.
Error: Could not locate file: "https://huggingface.co/Xenova/nanoLLaVA/resolve/main/onnx/model_quantized.onnx".
Reproduction
Use Xenova/nanoLLaVA
like this:
const featureExtractor = await transformers.pipeline('image-feature-extraction', 'Xenova/nanoLLaVA')
package.json
"@xenova/transformers": "^2.17.1",
I appreciate your enthusiasm with testing the model out, since I only added it a few hours ago... but I'm still adding support for it to the library! I will let you know when it is supported.
Brilliant, thank you very much!
I'm closely watching this feature, and if you link a PR for this I can glean from the work and help maintain the code!
You can follow along in the v3 branch: https://github.com/xenova/transformers.js/pull/545
Here's some example code which should work:
import { AutoTokenizer, AutoProcessor, RawImage, LlavaForConditionalGeneration } from '@xenova/transformers';
// Load tokenizer, processor and model
const model_id = 'Xenova/nanoLLaVA';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaForConditionalGeneration.from_pretrained(model_id, {
dtype: {
embed_tokens: 'fp16',
vision_encoder: 'q8', // or 'fp16'
decoder_model_merged: 'q4', // or 'q8'
},
});
// Prepare text inputs
const prompt = 'Describe this image in detail';
const messages = [
{ 'role': 'user', 'content': `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true })
const text_inputs = tokenizer(text, { padding: true });
// Prepare vision inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);
// Generate response
const inputs = { ...text_inputs, ...vision_inputs };
const output = await model.generate({
...inputs,
do_sample: false,
max_new_tokens: 64,
});
// Decode output
const decoded = tokenizer.batch_decode(output, { skip_special_tokens: false });
console.log('decoded', decoded);
Note that this may change in future, and I'll update the model card when I've done some more testing.
The model card has been updated with example code 👍 https://huggingface.co/Xenova/nanoLLaVA
We also put an online demo out for you to try: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu
Example videos:
https://github.com/xenova/transformers.js/assets/26504141/3f70437a-8943-44e4-87f0-795df90327f2
https://github.com/xenova/transformers.js/assets/26504141/10c0b4c1-2738-4dbc-ad2f-115f7248dd84