Ask-Anything
Ask-Anything copied to clipboard
Issues with Running video_chat2 on Multi-GPU Setup with Nvidia Titan Xp
Hi,
I'm currently attempting to run the video_chat2 model on a multi-GPU setup consisting of 8 Nvidia Titan Xp GPUs, each with 12GiB of memory. I'm using the mvbench.ipynb notebook from the Ask-Anything/video_chat2 repository for this purpose.
To ensure the model loads on my GPUs, I've enabled the low_resource option in config.json. Additionally, I've specified device_map="auto" during the initialization of the llama_model in videochat2_it.py. The relevant code snippet is as follows:
if self.low_resource:
self.llama_model = LlamaForCausalLM.from_pretrained(
llama_model_path,
load_in_8bit=True,
device_map="auto",
torch_dtype=torch.float16,
)
However, when I execute the code, I encounter multiple errors originating from the following lines:
seg_embs = [model.llama_model.base_model.model.model.embed_tokens(seg_t).cpu() for seg_t in seg_tokens] # get_context_emb
outputs = model.llama_model.generate()
Could you provide some guidance or suggestions on how to effectively perform inference with sharded models in this multi-GPU environment?
Thank you for your incredible works.
Thank you for your interest in our work. Could you provide more error information? We haven't attempted to load the model in shards before. We've successfully run it on a graphics card with at least 16GB of VRAM. We're happy to research this issue together if you can provide the error information from these lines.
I'm not sure all the error is general in this reopsitory. because I've modified demo.py and mvbench.py to inference without gradio. there wasn't significant changes in the code. but for reference, I will attach my code file. infer.txt
Anyway, enabling low_resource does returns multiple device mismatch error inside llama_model. The location of the issue is as follows. few more locations could be show afterwards from this examples.
- https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L237-L239
- https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/models/blip2/modeling_llama.py#L74
- https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/models/blip2/modeling_llama.py#L275
Here is one of the error code if I enable low_resource flag.
Traceback (most recent call last):
File "infer.py", line 88, in <module>
main_results = main()
File "infer.py", line 78, in main
results = ask_questions(chat, chat_state, img_list, questions)
File "infer.py", line 59, in ask_questions
llm_message, _, chat_state = chat.answer(conv=chat_state, img_list=img_list, max_new_tokens=1000, num_beams=num_beams, temperature=temperature)
File "/data1/doyi/Ask-Anything/video_chat2/conversation.py", line 65, in answer
outputs = self.model.llama_model.generate(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/peft_model.py", line 731, in generate
outputs = self.base_model.generate(**kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 1525, in generate
return self.sample(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/generation/utils.py", line 2622, in sample
outputs = self(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 674, in forward
outputs = self.model(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 563, in forward
layer_outputs = decoder_layer(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 273, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/data1/doyi/Ask-Anything/video_chat2/models/blip2/modeling_llama.py", line 178, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/peft/tuners/lora.py", line 710, in forward
self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I found a way to perform inference using about 9GiB of GPU memory by enabling low_resource. here is the way I modified the code.
- "cuda:0" to self.device https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L233
- changing torch_dtype to bool float
self.llama_model = LlamaForCausalLM.from_pretrained(
llama_model_path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
)
what I'm struggling now is that I want to inference without low_resource's int8 model loading with llama-7b.
self.llama_model = LlamaForCausalLM.from_pretrained(
llama_model_path,
torch_dtype=torch.bfloat16,
device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"}
)
after setting device_map returns value error
model = VideoChat2_it(config=cfg.model)
File "/data1/doyi/Ask-Anything/video_chat2/models/videochat2_it.py", line 142, in __init__
self.llama_model = LlamaForCausalLM.from_pretrained(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
) = cls._load_pretrained_model(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/data1/doyi/Ask-Anything/venv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 797, in _load_state_dict_into_meta_model
raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: model.embed_tokens.weight doesn't have any device set.
@ddoron9 For the first error, I think this may be caused by the hard coding in https://github.com/OpenGVLab/Ask-Anything/blob/389d886eda4e857c02b77f9d94403d76f3826b45/video_chat2/conversation.py#L233
Changing cuda:0
to self.device
may solve this problem, like this.
https://github.com/OpenGVLab/Ask-Anything/blob/d57c30fb86d41e510ba6b8fbeb5a532f3f3aaa69/video_chat2/conversation.py#L233
Please fix it and try again.
@ddoron9 For the second question, the ValueError indicates that model.embed_tokens.weight
doesn't have a device set, suggesting that during the model loading process, certain parameters (such as embed_tokens.weight
) aren't assigned to any device.
I'm not very familiar with this, but adding device_map={0: "4GiB", 1:"12GiB", 2:"12GiB", 3:"12GiB", 4:"12GiB", 5:"12GiB", 6:"12GiB", "cpu": "30GiB"}
doesn't seem to work. It appears that this isn't where you input hardware information for your device. According to my research on huggingface, it appears that the keys
of the map here should be filled with parameters from llama_model
, and the values
should indicate the device where you intend to place them. Maybe you should ensure that every part of the model has a designated device in the device_map
. This may involve examining the structure of the LlamaForCausalLM
model and how its parameters are initialized and loaded. For example you can set
{'embed_tokens.weight': 0,
'embed_tokens.bais': 1,
'encoder': "cpu"}
Perhaps device_map=auto
can also solve this problem.
Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.
Have you solved this problem? I faced a similar problem -- I can perform inference by setting the flag 'low_resource = true' in the config file, but I would always get the "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" Error when I set 'low_resource = false'. The above fixes do not work.
@Coronal-Halo Hi, It seems that the error occurs only when low_resource=True
for @ddoron9. And you are exactly the opposite. low_resource=False
is our default parameter, and our program runs normally at this time. I want to know if the solution mentioned in the above method includes https://github.com/OpenGVLab/Ask-Anything/issues/115#issuecomment-1916149322. If possible, could you provide more error information?
Thanks for your reply. I solved this problem by trying every combination of putting variables to cpu vs. gpu.
Hi, I tried to load the model with dual 4090 and still faced the same error after applying the changes. I looked into debugger and realized that it is because the input tensor device is switched automictically by pre forward hook which I believe is implemented by huggingface when setting device_map = "auto"
. Here is the step to reproduce the same behavior :
- Check the
hf_device_map
after loading the model and find the index of layer where device number is changed.
...
'model.layers.15':0,
'model.layers.16':1.
...
In my case, it is model.layer.16 2. Set a conditional breakpoint at https://github.com/OpenGVLab/Ask-Anything/blob/fedc48692dc05ea778bda8eaa978eb0bd85c572d/video_chat2/models/blip2/modeling_llama.py#L565-L572
3.add hidden_states.device
to watch and step into the function. The device will change from device(type='cuda', index=0)
to device(type='cuda', index=1)
. Beside, I checked the device of self.inputlayernorm.weight and it is located at cuda 0. Therefore, it would raise the error
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I did not encounter similar issue when I loaded v1 with the same setting for inference. Is it because v1 is using original llama while v2 isn't? Is there any workaround or fix here? Thanks.
Thanks for your reply. I solved this problem by trying every combination of putting variables to cpu vs. gpu.
Can you provide a specific example of the allocation code? I’m currently facing this issue and looking for a solution.