What does this PR do?

conversion required fix on optimum side: https://github.com/huggingface/optimum/pull/2179

model loading required to install janus package before conversion (and for preprocessing usage)

pip install "git+https://github.com/deepseek-ai/Janus.git"

from io import BytesIO
from pathlib import Path

import requests
from janus.models import VLChatProcessor
from PIL import Image
from transformers import TextStreamer

from optimum.intel.openvino import OVModelForVisualCausalLM

model_id = "deepseek-ai/Janus-Pro-1B"

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)

processor = VLChatProcessor.from_pretrained(model_id)

Multimodal understanding

input_prompt = "Describe image in details"
image_path = Path("cat_in_box.png")

if not image_path.exists():
    response = requests.get(
        "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
    )
    image = Image.open(BytesIO(response.content)).convert("RGB")
    image.save(image_path)

image = Image.open(image_path)

inputs = model.preprocess_inputs(input_prompt, image, processor)
streamer = TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)

model.generate(**inputs, streamer=streamer, max_new_tokens=100, do_sample=False)

Answer:

The image shows a gray tabby cat lying inside an open cardboard box on a light-colored carpet. The cat is lying on its back with its belly exposed, legs up in the air, and its tail curled around its body. The background includes a beige couch and a bright, airy room with natural light streaming in, creating a cozy and relaxed atmosphere.

Text to Image generation

image_gen_prompt = "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors."

images = model.generate_image(processor, image_gen_prompt, parallel_size=1)

images[0].save("fox.png")

Generated Image fox

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Feb 04 '25 07:02 eaidova

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Feb 04 '25 07:02 HuggingFaceDocBuilderDev

I wonder why we need to keep VLChatProcessor instance outside the model class and if we can move it inside?

Feb 06 '25 06:02 AlexKoff88

I wonder why we need to keep VLChatProcessor instance outside the model class and if we can move it inside?

not sure that I understand your question. This is standard preprocessing/postprocessing part for transformers-based models (like any other stuff - tokenizers, feature_extractors, image processors, e.t.c), usually it is an independent object (except diffusers case). It may be helpful for VLM models to move it closer as it becomes more complicated and bounded. So possibly we can consider keeping processors for other models as well (it may be helpful for alignment result of save_pretrained and optimum-cli, which also save processors and tokenizers if they are available)

Feb 06 '25 08:02 eaidova

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)

processor = VLChatProcessor.from_pretrained(model_id)

To clarity, I am just looking at the code in the PR description and wondering why it could not look like this:

model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True)
...
inputs = model.preprocess_inputs(input_prompt, image)
streamer = TextStreamer(model.tokenizer, skip_prompt=True, skip_special_tokens=True)
model.generate(**inputs, streamer=streamer, max_new_tokens=100, do_sample=False)
...
images = model.generate_image(image_gen_prompt, parallel_size=1)

So, processor is loaded inside the model and hidden from the user but it can be acquired like model.processor.

But from what I understood your implementation is aligned with diffusers, right?

Feb 06 '25 09:02 AlexKoff88

@IlyasMoutawwakil @echarlaix could you please take a look?

Feb 12 '25 16:02 eaidova

Apologies for the delay @eaidova!! We could need to make sure we are compatible with https://github.com/huggingface/transformers/pull/36053 before merging so that we don't have any issues when transformers v4.50 is out

Feb 26 '25 18:02 echarlaix

Apologies for the delay @eaidova!! We could need to make sure we are compatible with huggingface/transformers#36053 before merging so that we don't have any issues when transformers v4.50 is out

I carefully examinated provided PR, it seems to be that required reconvert model from original janus modeling to transformers realization (provided by DeepSeek original model will not work in transformers out of the box and separated dedicated models after conversion script probably will be uploaded). My PR is working with original Janus provided models, so there should not be issues because model_type in configs are different with what we use. This transformers-compatible models enabling will require additional efforts to add conversion configs and model class for inference, but it is non blocker for this PR and can be supported when transforemrs PR will be merged.

So my suggestion is to resolve conflicts and merge this PR if there is no any additional issues

Mar 27 '25 05:03 eaidova

@IlyasMoutawwakil could you please take a look? Tests failed due to max retry issue, not related to added model

Mar 31 '25 06:03 eaidova

@echarlaix , I understand your position regarding remote code and agree with that, but also would like to note that besides long time waiting for reviewing, it also brings a lot of inconvenience for uses:

code in transformers is compatible only with community models versions converted from original checkpoints https://huggingface.co/deepseek-community while original models still exists and working with remote code https://huggingface.co/collections/deepseek-ai/janus-6711d145e2b73d369adfd3cc and required reconvert model
it is not possible to support transformers released version without usage features that currently available only in transformers main branch (no any released package support it and I speak about changes related generate method parameters for images, so it will be hard to support just copying from transformers overriding methods) - extra time required waiting next transformers package to do that... non speaking about how long we already review this PR

From my point of view, it will be better to integrate remote code model for now and deprecate it in next release (when transformers that support janus officially released) in flavor usage of official transformers code (as anyway models live by own independent life on huggingdace hub)

Apr 24 '25 06:04 eaidova

@eaidova I agree with your point, waiting for the transformers integration + release can result in a delayed release, also it will be only compatible with the subset of models which are compatible with this integration. In my opinion we should stop adding any official new export to optimum-intel for specific modeling (needing trust_remote_code=True) that requires a modeling patch, as for these export many issues could arise on which we don't have any control on. Now for cases where we need it, what do you think about adding it in an example export script instead specifying what types of models this export is targeted for ? An other option could be to push it on a dedicated branch, what do you think of this option ? Also concerning janus support, it's already available through the v4.51.3-Janus-preview tag that we can use for testing, and we can add official support once v4.52 is released (~2 weeks). This PR has been opened for a while, apologies for this. Let me know if you'd like me to take care of the janus export using the v4.51.3-Janus-preview tag!

Apr 25 '25 11:04 echarlaix

@eaidova I agree with your point, waiting for the transformers integration + release can result in a delayed release, also it will be only compatible with the subset of models which are compatible with this integration. In my opinion we should stop adding any official new export to optimum-intel for specific modeling (needing trust_remote_code=True) that requires a modeling patch, as for these export many issues could arise on which we don't have any control on. Now for cases where we need it, what do you think about adding it in an example export script instead specifying what types of models this export is targeted for ? An other option could be to push it on a dedicated branch, what do you think of this option ? Also concerning janus support, it's already available through the v4.51.3-Janus-preview tag that we can use for testing, and we can add official support once v4.52 is released (~2 weeks). This PR has been opened for a while, apologies for this. Let me know if you'd like me to take care of the janus export using the v4.51.3-Janus-preview tag!

@echarlaix problem is not in testing, problem in importing functional that is non-officially released. it makes code nesty and requires a lot of check adding that it is available (tag does not create specific package and version in transformers, even 4.52.0dev version does not guarantee that this code is available)

Integration Janus model into transformers brings a lot of changes around generate method that is common transformers part. I do not want waste your and my time to making this PR larger and can not predict how many time it will take to be reviewed in next time....

Apr 25 '25 11:04 eaidova

optimum-intel
optimum-intel copied to clipboard

support janus model

What does this PR do?

Multimodal understanding

Text to Image generation

Before submitting

optimum-intel optimum-intel copied to clipboard

support janus model

What does this PR do?

Multimodal understanding

Text to Image generation

Before submitting

optimum-intel
optimum-intel copied to clipboard