vllm [Model] Support NVLM-D

Implement NVLM-D model.

FIX #9040 FIX #9041

Oct 03 '24 15:10 DarkLight1337

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Oct 03 '24 15:10 github-actions[bot]

I've fixed the errors up to but not including merging multimodal embeddings. We probably need to implement additional logic to handle tile tagging.

Oct 03 '24 19:10 DarkLight1337

(EDIT this was resolved by latest commits) I had a failure when trying to load the model weights

vllm serve nvidia/NVLM-D-72B --tensor-parallel-size 4     
...
  File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/loader.py", line 403, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/home/mgoin/code/vllm/vllm/model_executor/models/internvl.py", line 564, in load_weights
    self.vision_model.load_weights(weights_group["vision_model"])
  File "/home/mgoin/code/vllm/vllm/model_executor/models/intern_vit.py", line 366, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/mgoin/code/vllm/vllm/model_executor/model_loader/weight_utils.py", line 537, in default_weight_loader
    assert param.size() == loaded_weight.size(), (
AssertionError: Attempted to load weight (torch.Size([12288, 3200])) into parameter (torch.Size([9600, 3200]))

(UPDATE) Now I see an error during model initialization:

vllm serve nvidia/NVLM-D-72B --tensor-parallel-size 4 --enforce-eager --max-num-seqs 16
...
Traceback (most recent call last):
  File "/home/mgoin/code/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/worker/model_runner.py", line 1644, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mgoin/code/vllm/vllm/model_executor/models/internvl.py", line 533, in forward
    inputs_embeds = merge_multimodal_embeddings(
  File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 169, in merge_multimodal_embeddings
    mask = (input_ids == placeholder_token_id)
RuntimeError: The size of tensor a (98304) must match the size of tensor b (4) at non-singleton dimension 0

To aid debugging I made a FP8 model checkpoint: https://huggingface.co/nm-testing/NVLM-D-72B-FP8-dynamic

Oct 03 '24 19:10 mgoin

OK I'm able to use the model in online serving now. The outputs seem reasonable.

Oct 04 '24 03:10 DarkLight1337

Yup it sounds reasonable to me 😸 Nice work!

vllm serve nm-testing/NVLM-D-72B-FP8-dynamic --tensor-parallel-size 4 --enforce-eager --max-num-seqs 16

Oct 04 '24 04:10 mgoin

Now I just have to set up the offline examples...

Oct 04 '24 04:10 DarkLight1337

@DarkLight1337 I plugged it into the existing run_internvl example within offline_inference_vision_language.py

This was the output, which seems reasonable:

Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.33it/s, est. speed input: 4667.61 toks/s, output: 84.87 toks/s]
The image features a tall, slender structure that resembles a communications tower, partially obscured by the branches and blossoms of cherry trees in full bloom. The structure has a white and light blue color scheme and is topped with an antenna. The cherry blossoms, with their delicate pink flowers, frame the tower, creating a picturesque
The image features a tall, white tower with a distinctive design, partially obscured by cherry blossom trees in full bloom. The tower is likely a telecommunications or observation tower, characterized by its lattice structure and observation deck near the top. The cherry blossoms, with their delicate pink flowers, frame the tower, creating a picturesque scene
The image features a tall, white tower with a distinctive design, partially obscured by cherry blossom trees in full bloom. The cherry blossoms, with their delicate pink flowers, create a beautiful contrast against the blue sky. The tower's structure is intricate, with a combination of straight and curved lines, and it appears to be
The image shows a tall building with a spire, surrounded by cherry blossom trees in full bloom. The building is white and has a modern architectural style, with a distinctive spire that tapers off at the top. The cherry blossom trees are in the foreground, with their pink and white flowers creating a beautiful contrast against

Oct 04 '24 04:10 mgoin

That is odd, I am getting completely nonsense results on my end.

Oct 04 '24 04:10 DarkLight1337

@mgoin Can you check whether the multi-image example also works?

Oct 04 '24 04:10 DarkLight1337

If I set num_prompts=1 then I don't get this problem.

Oct 04 '24 04:10 DarkLight1337

It seems to be an issue on the machine I am using to test the model. I can't run any models with both TP>1 and max_num_seqs>1 there.

Update: Thanks @ywang96 for helping test this!

Oct 04 '24 08:10 DarkLight1337

Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?

Oct 07 '24 09:10 anonymousz97

Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?

What error are you running into specifically?

Oct 07 '24 09:10 DarkLight1337

Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?

What error are you running into specifically?

It show problem with numpy not installed while i already done the installation. I use pip install -e .

Oct 07 '24 10:10 anonymousz97

Hi, i'm not able to build the new vllm. Has anyone tried to build this pull from the source code?

What error are you running into specifically?

It show problem with numpy not installed while i already done the installation. I use pip install -e .

Does this also happen on main branch? This sounds similar to #8851

Oct 07 '24 10:10 DarkLight1337

#8851

Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.

Oct 07 '24 10:10 anonymousz97

#8851

Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.

I suggest you provide more details in that issue then, since it's not specific to this PR.

Oct 07 '24 10:10 DarkLight1337

#8851

Ya, i think it is the same but i use python 3.10 with numpy 1.26.4 and i use ubuntu. I have already read the #8851 , but haven't seen the solution yet.

I suggest you provide more details in that issue then, since it's not specific to this PR.

Ah yes, i will do that. Thanks!

Oct 07 '24 10:10 anonymousz97

When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?

Oct 08 '24 13:10 anonymousz97

When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?

I get a similar issue on my end. The memory usage scales with number of GPUs used. When loading into 4 GPUs, it only uses around 500 GiB CPU memory. That being said, I haven't tried loading such a large model in vLLM before, so I'm not sure whether this is normal or not.

Oct 08 '24 13:10 DarkLight1337

When i use the code from this branch i found out that the memory loaded before the model completely loaded to GPU is kind of big. It cost me more than 1000 GiB RAM. Anyone has this problem? Or should i use any special config vllm serve? For ones who can loaded it without any problem, can you guys give me some information about the server running?

I get a similar issue on my end. The memory usage scales with number of GPUs used. When loading into 4 GPUs, it only uses around 500 GiB CPU memory. That being said, I haven't tried loading such a large model in vLLM before, so I'm not sure whether this is normal or not.

I believe there is some problems, since there is no explanation why the required memory is quite high? I have experienced it with meta llama 3.2 vision too but as i remembered i have to setup some parameters to fix that problem. Maybe the problem with the vision part.

Oct 08 '24 13:10 anonymousz97

@youkaichao is this something to be expected when using TP/PP?

Oct 08 '24 14:10 DarkLight1337

usually, if the model is in safetensors format, the memory cost will not increase when using TP/PP. I'm not sure if it is related with https://github.com/vllm-project/vllm/pull/9160

Oct 08 '24 18:10 youkaichao

@anonymousz97 can you try out #9160 and see if it helps reduce the memory usage?

Oct 08 '24 18:10 DarkLight1337

When is the new API release expected that incorporates this PR? I don't wanna pull a nightly version.

Oct 09 '24 17:10 digantamisra98

When is the new API release expected that incorporates this PR? I don't wanna pull a nightly version.

We normally release an update around every 2 weeks, so the next release should be soon. Also see #9200

Oct 10 '24 02:10 DarkLight1337

@anonymousz97 can you try out #9160 and see if it helps reduce the memory usage?

I can load it without any problems, thanks, confirm working! @DarkLight1337

Oct 10 '24 02:10 anonymousz97