Ask-Anything Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp

Hi,

I'm currently attempting to run the video_chat2 model on a multi-GPU setup consisting of 8 Nvidia Titan Xp GPUs, each with 12GiB of memory. I'm using the mvbench.ipynb notebook from the Ask-Anything/video_chat2 repository for this purpose.

To ensure the model loads on my GPUs, I've enabled the low_resource option in config.json. Additionally, I've specified device_map="auto" during the initialization of the llama_model in videochat2_it.py. The relevant code snippet is as follows:

if self.low_resource:
    self.llama_model = LlamaForCausalLM.from_pretrained(
        llama_model_path,
        load_in_8bit=True,
        device_map="auto",
        torch_dtype=torch.float16,
    )

However, when I execute the code, I encounter multiple errors originating from the following lines:

seg_embs = [model.llama_model.base_model.model.model.embed_tokens(seg_t).cpu() for seg_t in seg_tokens] # get_context_emb

outputs = model.llama_model.generate()

Could you provide some guidance or suggestions on how to effectively perform inference with sharded models in this multi-GPU environment?

Thank you for your incredible works.

Jan 26 '24 07:01 ddoron9

Thank you for your interest in our work. Could you provide more error information? We haven't attempted to load the model in shards before. We've successfully run it on a graphics card with at least 16GB of VRAM. We're happy to research this issue together if you can provide the error information from these lines.

Jan 26 '24 07:01 yinanhe

I'm not sure all the error is general in this reopsitory. because I've modified demo.py and mvbench.py to inference without gradio. there wasn't significant changes in the code. but for reference, I will attach my code file. infer.txt

Anyway, enabling low_resource does returns multiple device mismatch error inside llama_model. The location of the issue is as follows. few more locations could be show afterwards from this examples.

https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L237-L239
https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/models/blip2/modeling_llama.py#L74
https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/models/blip2/modeling_llama.py#L275

Here is one of the error code if I enable low_resource flag.

Traceback (most recent call last):
  File "infer.py", line 88, in <module>
    main_results = main()
  File "infer.py", line 78, in main
    results = ask_questions(chat, chat_state, img_list, questions)
  File "infer.py", line 59, in ask_questions
    llm_message, _, chat_state = chat.answer(conv=chat_state, img_list=img_list, max_new_tokens=1000, num_beams=num_beams, temperature=temperature)
  File "/data1/doyi/Ask-Anything/video_chat2/conversation.py", line 65, in answer
    outputs = self.model.llama_model.generate(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/peft_model.py", line 731, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 1525, in generate
    return self.sample(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 2622, in sample
    outputs = self(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 674, in forward
    outputs = self.model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 563, in forward
    layer_outputs = decoder_layer(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 178, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/tuners/lora.py", line 710, in forward
    self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I found a way to perform inference using about 9GiB of GPU memory by enabling low_resource. here is the way I modified the code.

"cuda:0" to self.device https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L233
changing torch_dtype to bool float

                self.llama_model = LlamaForCausalLM.from_pretrained(
                    llama_model_path,
                    torch_dtype=torch.bfloat16,
                    load_in_8bit=True,
                )

what I'm struggling now is that I want to inference without low_resource's int8 model loading with llama-7b.

               self.llama_model = LlamaForCausalLM.from_pretrained(
                   llama_model_path,
                   torch_dtype=torch.bfloat16,
                   device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"}
               )

after setting device_map returns value error

    model = VideoChat2_it(config=cfg.model)
  File "/data1/doyi/Ask-Anything/video_chat2/models/videochat2_it.py", line 142, in __init__
    self.llama_model = LlamaForCausalLM.from_pretrained(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 797, in _load_state_dict_into_meta_model
    raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.embed_tokens.weight doesn't have any device set.

Jan 29 '24 02:01 ddoron9

@ddoron9 For the first error, I think this may be caused by the hard coding in https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L233

Changing cuda:0 to self.device may solve this problem, like this.

https://github.com/OpenGVLab/Ask-Anything/blob/d57c30fb86d41e510ba6b8fbeb5a532f3f3aaa69/video_chat2/conversation.py#L233

Please fix it and try again.

Jan 30 '24 06:01 yinanhe

@ddoron9 For the second question, the ValueError indicates that model.embed_tokens.weight doesn't have a device set, suggesting that during the model loading process, certain parameters (such as embed_tokens.weight) aren't assigned to any device.

I'm not very familiar with this, but adding device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"} doesn't seem to work. It appears that this isn't where you input hardware information for your device. According to my research on huggingface, it appears that the keys of the map here should be filled with parameters from llama_model, and the values should indicate the device where you intend to place them. Maybe you should ensure that every part of the model has a designated device in the device_map. This may involve examining the structure of the LlamaForCausalLM model and how its parameters are initialized and loaded. For example you can set

{'embed_tokens.weight': 0,
 'embed_tokens.bais': 1,
 'encoder': "cpu"}

Perhaps device_map=auto can also solve this problem.

Jan 30 '24 06:01 yinanhe

Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.

Jan 30 '24 19:01 Coronal-Halo

Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.

@Coronal-Halo Hi, It seems that the error occurs only when low_resource=True for @ddoron9. And you are exactly the opposite. low_resource=False is our default parameter, and our program runs normally at this time. I want to know if the solution mentioned in the above method includes https://github.com/OpenGVLab/Ask-Anything/issues/115#issuecomment-1916149322. If possible, could you provide more error information?

Jan 31 '24 03:01 yinanhe

Thanks for your reply. I solved this problem by trying every combination of putting variables to cpu vs. gpu.

Feb 09 '24 21:02 Coronal-Halo

Hi, I tried to load the model with dual 4090 and still faced the same error after applying the changes. I looked into debugger and realized that it is because the input tensor device is switched automictically by pre forward hook which I believe is implemented by huggingface when setting device_map = "auto" . Here is the step to reproduce the same behavior :

Check the hf_device_map after loading the model and find the index of layer where device number is changed.

...
'model.layers.15':0,
'model.layers.16':1.
...

In my case, it is model.layer.16 2. Set a conditional breakpoint at https://github.com/OpenGVLab/Ask-Anything/blob/fedc48692dc05ea778bda8eaa978eb0bd85c572d/video_chat2/models/blip2/modeling_llama.py#L565-L572

3.add hidden_states.device to watch and step into the function. The device will change from device(type='cuda', index=0) to device(type='cuda', index=1) . Beside, I checked the device of self.inputlayernorm.weight and it is located at cuda 0. Therefore, it would raise the error

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I did not encounter similar issue when I loaded v1 with the same setting for inference. Is it because v1 is using original llama while v2 isn't? Is there any workaround or fix here? Thanks.

Mar 10 '24 03:03 kenchanLOL

Thanks for your reply. I solved this problem by trying every combination of putting variables to cpu vs. gpu.

Can you provide a specific example of the allocation code? I’m currently facing this issue and looking for a solution.

Jul 11 '24 02:07 zhengxingmao

Ask-Anything Ask-Anything copied to clipboard

Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp

Ask-Anything
Ask-Anything copied to clipboard