[Model] Support NVLM-D
Implement NVLM-D model.
FIX #9040 FIX #9041
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
I've fixed the errors up to but not including merging multimodal embeddings. We probably need to implement additional logic to handle tile tagging.
(EDIT this was resolved by latest commits) I had a failure when trying to load the model weights
vllm serve nvidia/NVLM-D-72B --tensor-parallel-size 4
...
File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/loader.py", line 403, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/home/mgoin/code/vllm/vllm/model_executor/models/internvl.py", line 564, in load_weights
self.vision_model.load_weights(weights_group["vision_model"])
File "/home/mgoin/code/vllm/vllm/model_executor/models/intern_vit.py", line 366, in load_weights
weight_loader(param, loaded_weight)
File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/weight_utils.py", line 537, in default_weight_loader
assert param.size() == loaded_weight.size(), (
AssertionError: Attempted to load weight (torch.Size([12288, 3200])) into parameter (torch.Size([9600, 3200]))
(UPDATE) Now I see an error during model initialization:
vllm serve nvidia/NVLM-D-72B --tensor-parallel-size 4 --enforce-eager --max-num-seqs 16
...
Traceback (most recent call last):
File "/home/mgoin/code/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 1644, in execute_model
hidden_or_intermediate_states = model_executable(
File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mgoin/code/vllm/vllm/model_executor/models/internvl.py", line 533, in forward
inputs_embeds = merge_multimodal_embeddings(
File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 169, in merge_multimodal_embeddings
mask = (input_ids == placeholder_token_id)
RuntimeError: The size of tensor a (98304) must match the size of tensor b (4) at non-singleton dimension 0
To aid debugging I made a FP8 model checkpoint: https://huggingface.co/nm-testing/NVLM-D-72B-FP8-dynamic
OK I'm able to use the model in online serving now. The outputs seem reasonable.
Yup it sounds reasonable to me 😸 Nice work!
vllm serve nm-testing/NVLM-D-72B-FP8-dynamic --tensor-parallel-size 4 --enforce-eager --max-num-seqs 16
Now I just have to set up the offline examples...
@DarkLight1337 I plugged it into the existing run_internvl example within offline_inference_vision_language.py
This was the output, which seems reasonable:
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.33it/s, est. speed input: 4667.61 toks/s, output: 84.87 toks/s]
The image features a tall, slender structure that resembles a communications tower, partially obscured by the branches and blossoms of cherry trees in full bloom. The structure has a white and light blue color scheme and is topped with an antenna. The cherry blossoms, with their delicate pink flowers, frame the tower, creating a picturesque
The image features a tall, white tower with a distinctive design, partially obscured by cherry blossom trees in full bloom. The tower is likely a telecommunications or observation tower, characterized by its lattice structure and observation deck near the top. The cherry blossoms, with their delicate pink flowers, frame the tower, creating a picturesque scene
The image features a tall, white tower with a distinctive design, partially obscured by cherry blossom trees in full bloom. The cherry blossoms, with their delicate pink flowers, create a beautiful contrast against the blue sky. The tower's structure is intricate, with a combination of straight and curved lines, and it appears to be
The image shows a tall building with a spire, surrounded by cherry blossom trees in full bloom. The building is white and has a modern architectural style, with a distinctive spire that tapers off at the top. The cherry blossom trees are in the foreground, with their pink and white flowers creating a beautiful contrast against
That is odd, I am getting completely nonsense results on my end.
@mgoin Can you check whether the multi-image example also works?
If I set num_prompts=1 then I don't get this problem.
It seems to be an issue on the machine I am using to test the model. I can't run any models with both TP>1 and max_num_seqs>1 there.
Update: Thanks @ywang96 for helping test this!
Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?
Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?
What error are you running into specifically?
Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?
What error are you running into specifically?
It show problem with numpy not installed while i already done the installation. I use pip install -e .
Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?
What error are you running into specifically?
It show problem with numpy not installed while i already done the installation. I use
pip install -e .
Does this also happen on main branch? This sounds similar to #8851
#8851
Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.
#8851
Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.
I suggest you provide more details in that issue then, since it's not specific to this PR.
#8851
Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.
I suggest you provide more details in that issue then, since it's not specific to this PR.
Ah yes, i will do that. Thanks!
When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?
When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?
I get a similar issue on my end. The memory usage scales with number of GPUs used. When loading into 4 GPUs, it only uses around 500 GiB CPU memory. That being said, I haven't tried loading such a large model in vLLM before, so I'm not sure whether this is normal or not.
When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?
I get a similar issue on my end. The memory usage scales with number of GPUs used. When loading into 4 GPUs, it only uses around 500 GiB CPU memory. That being said, I haven't tried loading such a large model in vLLM before, so I'm not sure whether this is normal or not.
I believe there is some problems, since there is no explanation why the required memory is quite high? I have experienced it with meta llama 3.2 vision too but as i remembered i have to setup some parameters to fix that problem. Maybe the problem with the vision part.
@youkaichao is this something to be expected when using TP/PP?
usually, if the model is in safetensors format, the memory cost will not increase when using TP/PP. I'm not sure if it is related with https://github.com/vllm-project/vllm/pull/9160
@anonymousz97 can you try out #9160 and see if it helps reduce the memory usage?
When is the new API release expected that incorporates this PR? I don't wanna pull a nightly version.
When is the new API release expected that incorporates this PR? I don't wanna pull a nightly version.
We normally release an update around every 2 weeks, so the next release should be soon. Also see #9200
@anonymousz97 can you try out #9160 and see if it helps reduce the memory usage?
I can load it without any problems, thanks, confirm working! @DarkLight1337