Video-LLaVA
Video-LLaVA copied to clipboard
Distributed Inference Doesn't Work
I followed the instructions in the README to install the conda environment and run the video inference sample code. I get an error
line 289, in prepare_inputs_labels_for_multimodal
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:6! (when checking argument for argument tensors in method wrapper_CUDA_cat)
The code runs successfully when I limit the visible cuda devices to a single GPU, e.g. CUDA_VISIBLE_DEVICES=0.
HELLO @dfan what gpu are you using ? Are I'm facing the same problem as you, Have you solve this problem ?
ok guys, i fixed the same problem using the command
--device "cuda:0"
might help you
CUDA_VISIBLE_DEVICES=0 your_script.py
Thx ~
@dfan , were you able to run it with multiple GPUs? I also need distributed inference.
@dfan, I've fixed it in https://github.com/PKU-YuanGroup/Video-LLaVA/pull/145. We no longer need to restrict inference to a single device (e.g., cuda:0
). With this PR, we can distribute the inference to as many GPUs as we want (e.g., coda:0,1
for 0 and 1 or cuda
for all GPUs available).
@LinB203 , please check this.
@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?
@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?
It should allow using all your GPUs. For example, if your VRAM runs out of allocatable mamory space with a single GPU, that shouldn't happen with this LLaVA distributed inference fix.
@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?
It should allow using all your GPUs. For example, if your VRAM runs out of allocatable mamory space with a single GPU, that shouldn't happen with this LLaVA distributed inference fix.
After implementing your proposed fix: `[...] cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
cur_new_labels = torch.cat(cur_new_labels)`
`File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(*args, **kwargs) 1522 try: 1523 result = None
File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.
File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:346, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache) 343 attn_weights = attn_weights + attention_mask 345 # upcast attention to fp32 --> 346 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) 347 attn_output = torch.matmul(attn_weights, value_states) 349 if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.41 GiB. GPU 0 has a total capacity of 14.58 GiB of which 940.50 MiB is free.`
Running it without your fix throws the 'expected' error
line 289, in prepare_inputs_labels_for_multimodal cur_new_input_embeds = torch.cat(cur_new_input_embeds) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:6! (when checking argument for argument tensors in method wrapper_CUDA_cat)
@dandre0102 , it might depend on your data and the number of tokens. For example, for my use case, with a total of 48GB GPU (cuda:0,1
), I need 4bit quantization. With cuda:0
or cuda:1
on their own (24GB), it runs out of memory even with 4bit quantization.