[New Model]: Florence-2
The model to consider.
https://huggingface.co/microsoft/Florence-2-base
The closest model vllm already supports.
phi-3v , its a vlm
What's your difficulty of supporting the model you want?
No response
@DarkLight1337 Anyone working on this?
No, but please wait for #5852 and #5276 to land first as they involve significant API changes for devs. In the meantime, you can take a look at at this guide to get an idea of how to implement a new model.
Thanks, checking the guide and the previous PRs of adding phi3-vision, also #5276
Both https://github.com/vllm-project/vllm/pull/5852 and https://github.com/vllm-project/vllm/pull/5276 is merged. Do you still have plans to work on this PR @chandeldivyam ?
@fcakyon Thanks for the reminder, it actually slipped my mind. Yes, I need florence-2 for a project I was working on. So, as an alternative for quick prototyping, I created a flask server but it is not the ideal solution. I will pick it up in the next week. Thanks!
Are you working on something that would need it?
@chandeldivyam Yes, I also need such a solution for my work. I'm trying to utilize https://github.com/Lightning-AI/LitServe since I only have a little experience with the vllm-project.
@fcakyon have you looked into any benchmarking for litserve? Also, I think using vllm would make sense if there are ton of parallel requests right?
@chandeldivyam Would be great to see florence-2 in vllm.
Hey @chandeldivyam, Is there a PR already to track the progress on Florence-2? Would be great to have Florence-2 with vllm 😀
Since there's been no update on this issue, this week I referred to the guide here and looked at how to add Phi3-vision to vLLM. I implemented the registry, but I ran into the following issue:
File "/app/vllm/entrypoints/llm.py", line 177, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/vllm/engine/llm_engine.py", line 541, in from_engine_args
engine = cls(
^^^^
File "/app/vllm/engine/llm_engine.py", line 302, in __init__
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/app/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/app/vllm/executor/gpu_executor.py", line 38, in _init_executor
self.driver_worker = self._create_worker()
^^^^^^^^^^^^^^^^^^^^^
File "/app/vllm/executor/gpu_executor.py", line 105, in _create_worker
return create_worker(**self._get_create_worker_kwargs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/vllm/executor/gpu_executor.py", line 24, in create_worker
wrapper.init_worker(**kwargs)
File "/app/vllm/worker/worker_base.py", line 449, in init_worker
self.worker = worker_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/vllm/worker/worker.py", line 101, in __init__
self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
^^^^^^^^^^^^^^^^^
File "/app/vllm/worker/enc_dec_model_runner.py", line 115, in __init__
assert_enc_dec_mr_supported_scenario(self)
File "/app/vllm/worker/utils.py", line 43, in assert_enc_dec_mr_supported_scenario
raise NotImplementedError(
NotImplementedError: Multimodal is not currently supported with encoder/decoder models.
This error indicates that the Florence2 configuration has is_encoder_decoder:true, but the current EncoderDecoderModelRunner does not support multimodal. I think finding a workaround will be difficult since we really need this support. Can anyone give advice or suggest what to do next?
This error indicates that the Florence2 configuration has is_encoder_decoder:true, but the current EncoderDecoderModelRunner does not support multimodal. I think finding a workaround will be difficult since we really need this support. Can anyone give advice or suggest what to do next?
If only the language part of the model is using encoder-decoder (i.e. there is no cross-attention between text and visual features), then you can try implementing only the language part in vLLM first.
This error indicates that the Florence2 configuration has is_encoder_decoder:true, but the current EncoderDecoderModelRunner does not support multimodal. I think finding a workaround will be difficult since we really need this support. Can anyone give advice or suggest what to do next?
If only the language part of the model is using encoder-decoder (i.e. there is no cross-attention between text and visual features), then you can try implementing only the language part in vLLM first.
@DarkLight1337, thanks for your comment. I think I understand, and it seems feasible. Since Florence2 only uses the encoder-decoder for the language part, specifically in the Florence2LanguageModel class, I can implement the language part and the vision part (DaViT) separately, then combine them later. I just need to organize the massive 2800 lines in the original modeling_florence.py file properly.
Hey whats the update on this one?How to do i Run florence 2 using vllm?
+1
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
Hey all, any progress on this?
cc @Isotr0py
Oh, I totally forgot this... 😅 Let me port the ViT for the florence models to finish this.
wow @Isotr0py @DarkLight1337 thank you for such a fast reaction