LLaVA [Usage] llava-v1.6-mistral-7b will load in the demo, but llava-v1.6-34b will not.

Describe the issue

Issue: When starting a worker with 34B version of the 1.6 model, the worker will crash on the first inference. I've verified that the mistal-7b version does work and I can run the demo with the mistral version; this only happens on the 34B:

Command:

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ~/models/liuhaotian_llava-v1.6-34b/

Log:

[2024-01-31 22:40:12,336] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-31 22:40:12 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='/home/iceman/models/liuhaotian_llava-v1.6-34b/', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)                                                                               2024-01-31 22:40:12 | INFO | model_worker | Loading the model liuhaotian_llava-v1.6-34b on worker b95d53 ...            You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.                                                                                    Loading checkpoint shards:   0%|                                                                 | 0/15 [00:00<?, ?it/s]Loading checkpoint shards:   7%|███▊                                                     | 1/15 [00:01<00:24,  1.78s/it]Loading checkpoint shards:  13%|███████▌                                                 | 2/15 [00:03<00:23,  1.78s/it]Loading checkpoint shards:  20%|███████████▍                                             | 3/15 [00:05<00:21,  1.82s/it]Loading checkpoint shards:  27%|███████████████▏                                         | 4/15 [00:07<00:19,  1.81s/it]Loading checkpoint shards:  33%|███████████████████                                      | 5/15 [00:08<00:17,  1.80s/it]Loading checkpoint shards:  40%|██████████████████████▊                                  | 6/15 [00:10<00:16,  1.83s/it]Loading checkpoint shards:  47%|██████████████████████████▌                              | 7/15 [00:12<00:14,  1.81s/it]Loading checkpoint shards:  53%|██████████████████████████████▍                          | 8/15 [00:14<00:12,  1.80s/it]Loading checkpoint shards:  60%|██████████████████████████████████▏                      | 9/15 [00:16<00:10,  1.82s/it]Loading checkpoint shards:  67%|█████████████████████████████████████▎                  | 10/15 [00:18<00:09,  1.81s/it]Loading checkpoint shards:  73%|█████████████████████████████████████████               | 11/15 [00:19<00:07,  1.80s/it]Loading checkpoint shards:  80%|████████████████████████████████████████████▊           | 12/15 [00:21<00:05,  1.82s/it]Loading checkpoint shards:  87%|████████████████████████████████████████████████▌       | 13/15 [00:23<00:03,  1.81s/it]Loading checkpoint shards:  93%|████████████████████████████████████████████████████▎   | 14/15 [00:25<00:01,  1.80s/it]Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 15/15 [00:26<00:00,  1.49s/it]Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 15/15 [00:26<00:00,  1.74s/it]2024-01-31 22:40:42 | ERROR | stderr |
2024-01-31 22:40:43 | INFO | model_worker | Register to controller
2024-01-31 22:40:43 | ERROR | stderr | INFO:     Started server process [7458]                                          2024-01-31 22:40:43 | ERROR | stderr | INFO:     Waiting for application startup.
2024-01-31 22:40:43 | ERROR | stderr | INFO:     Application startup complete.
2024-01-31 22:40:43 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)         2024-01-31 22:40:50 | INFO | stdout | INFO:     127.0.0.1:39398 - "POST /worker_get_status HTTP/1.1" 200 OK
2024-01-31 22:40:54 | INFO | model_worker | Send heart beat. Models: ['liuhaotian_llava-v1.6-34b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1                                                                              2024-01-31 22:40:54 | INFO | stdout | INFO:     127.0.0.1:39402 - "POST /worker_generate_stream HTTP/1.1" 200 OK
2024-01-31 22:40:54 | ERROR | stderr | Exception in thread Thread-3 (generate):
2024-01-31 22:40:54 | ERROR | stderr | Traceback (most recent call last):
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-01-31 22:40:54 | ERROR | stderr |     self.run()
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/threading.py", line 953, in run
2024-01-31 22:40:54 | ERROR | stderr |     self._target(*self._args, **self._kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-31 22:40:54 | ERROR | stderr |     return func(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/model/language_model/llava_llama.py", line 125, in generate
2024-01-31 22:40:54 | ERROR | stderr |     ) = self.prepare_inputs_labels_for_multimodal(
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/model/llava_arch.py", line 157, in prepare_inputs_labels_for_multimodal
2024-01-31 22:40:54 | ERROR | stderr |     image_features = self.encode_images(concat_images)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/model/llava_arch.py", line 141, in encode_images
2024-01-31 22:40:54 | ERROR | stderr |     image_features = self.get_model().get_vision_tower()(images)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-31 22:40:54 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-31 22:40:54 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-31 22:40:54 | ERROR | stderr |     return func(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 50, in forward
2024-01-31 22:40:54 | ERROR | stderr |     image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-31 22:40:54 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-31 22:40:54 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 917, in forward
2024-01-31 22:40:54 | ERROR | stderr |     return self.vision_model(
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-31 22:40:54 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-31 22:40:54 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 841, in forward
2024-01-31 22:40:54 | ERROR | stderr |     hidden_states = self.embeddings(pixel_values)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-31 22:40:54 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-31 22:40:54 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 182, in forward
2024-01-31 22:40:54 | ERROR | stderr |     patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-31 22:40:54 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-31 22:40:54 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
2024-01-31 22:40:54 | ERROR | stderr |     return self._conv_forward(input, self.weight, self.bias)
2024-01-31 22:40:54 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
2024-01-31 22:40:54 | ERROR | stderr |     return F.conv2d(input, weight, bias, self.stride,
2024-01-31 22:40:54 | ERROR | stderr | RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

Given the error is complaining about tensors on two cuda devices on this machine (this is a 2x6000 workstation), I tried running with CUDA_VISIBLE_DEVICES=0 to have it only work with a single card, but that also doesn't work: the worker doesn't even successfully launch itself, hard crashing before it communicates with the gradio process:

Command:

CUDA_VISIBLE_DEVICES=0 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ~/models/liuhaotian_llava-v1.6-34b/

Log:

[2024-01-31 22:46:17,323] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-31 22:46:17 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='/home/iceman/models/liuhaotian_llava-v1.6-34b/', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-01-31 22:46:17 | INFO | model_worker | Loading the model liuhaotian_llava-v1.6-34b on worker f19483 ...
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                                    | 0/15 [00:00<?, ?it/s]Loading checkpoint shards:   7%|████████▎                                                                                                                   | 1/15 [00:01<00:25,  1.86s/it]Loading checkpoint shards:  13%|████████████████▌                                                                                                           | 2/15 [00:03<00:24,  1.85s/it]Loading checkpoint shards:  20%|████████████████████████▊                                                                                                   | 3/15 [00:05<00:22,  1.90s/it]Loading checkpoint shards:  27%|█████████████████████████████████                                                                                           | 4/15 [00:07<00:20,  1.89s/it]Loading checkpoint shards:  33%|█████████████████████████████████████████▎                                                                                  | 5/15 [00:09<00:18,  1.88s/it]Loading checkpoint shards:  40%|█████████████████████████████████████████████████▌                                                                          | 6/15 [00:11<00:17,  1.90s/it]Loading checkpoint shards:  47%|█████████████████████████████████████████████████████████▊                                                                  | 7/15 [00:13<00:15,  1.89s/it]Loading checkpoint shards:  53%|██████████████████████████████████████████████████████████████████▏                                                         | 8/15 [00:15<00:13,  1.88s/it]Loading checkpoint shards:  60%|██████████████████████████████████████████████████████████████████████████▍                                                 | 9/15 [00:16<00:11,  1.90s/it]Loading checkpoint shards:  67%|██████████████████████████████████████████████████████████████████████████████████                                         | 10/15 [00:18<00:09,  1.88s/it]Loading checkpoint shards:  73%|██████████████████████████████████████████████████████████████████████████████████████████▏                                | 11/15 [00:20<00:06,  1.72s/it]Loading checkpoint shards:  80%|██████████████████████████████████████████████████████████████████████████████████████████████████▍                        | 12/15 [00:21<00:04,  1.66s/it]Loading checkpoint shards:  87%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▌                | 13/15 [00:23<00:03,  1.55s/it]Loading checkpoint shards:  93%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊        | 14/15 [00:24<00:01,  1.48s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:24<00:00,  1.20s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:24<00:00,  1.66s/it]2024-01-31 22:46:46 | ERROR | stderr |
2024-01-31 22:46:46 | ERROR | stderr | Traceback (most recent call last):
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2024-01-31 22:46:46 | ERROR | stderr |     return _run_code(code, main_globals, None,
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/miniconda3/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code
2024-01-31 22:46:46 | ERROR | stderr |     exec(code, run_globals)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/serve/model_worker.py", line 277, in <module>
2024-01-31 22:46:46 | ERROR | stderr |     worker = ModelWorker(args.controller_address,
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/serve/model_worker.py", line 65, in __init__
2024-01-31 22:46:46 | ERROR | stderr |     self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/src/LLaVA/llava/model/builder.py", line 151, in load_pretrained_model
2024-01-31 22:46:46 | ERROR | stderr |     vision_tower.to(device=device, dtype=torch.float16)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
2024-01-31 22:46:46 | ERROR | stderr |     return self._apply(convert)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2024-01-31 22:46:46 | ERROR | stderr |     module._apply(fn)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2024-01-31 22:46:46 | ERROR | stderr |     module._apply(fn)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
2024-01-31 22:46:46 | ERROR | stderr |     module._apply(fn)
2024-01-31 22:46:46 | ERROR | stderr |   [Previous line repeated 1 more time]
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
2024-01-31 22:46:46 | ERROR | stderr |     param_applied = fn(param)
2024-01-31 22:46:46 | ERROR | stderr |   File "/home/iceman/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
2024-01-31 22:46:46 | ERROR | stderr |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2024-01-31 22:46:46 | ERROR | stderr | NotImplementedError: Cannot copy out of meta tensor; no data!

Current commit ran: c878cc3e66f75eb8227870be3d30268789913f82

Feb 01 '24 03:02 iceman-p

Same problem

Feb 01 '24 08:02 serjsaraev

Same issue #1039

Feb 01 '24 16:02 levi

Bumping the vram to 80GB appears to have resolved it for me. Possibly a OOM error?

Feb 01 '24 17:02 levi

Bumping the vram to 80GB appears to have resolved it for me. Possibly a OOM error?

That would explain why when I restrict cuda visibility to a single 48gb card I get the error, but it doesn't solve the main problem: two 48gb cards should(tm) provide enough vram and the main bug here is it isn't splitting between the cards.

Feb 01 '24 21:02 iceman-p

Same issue here. Were you able to fix that ? @levi @iceman-p
I suspect this is related to device="auto" and low_cpu_mem_usage=True

Feb 06 '24 09:02 matankley

@iceman-p Hi how did you load the 7b one? I am having trouble loading as i get https://github.com/haotian-liu/LLaVA/issues/1112

Feb 10 '24 05:02 aliencaocao

LLaVA LLaVA copied to clipboard

[Usage] llava-v1.6-mistral-7b will load in the demo, but llava-v1.6-34b will not.

Describe the issue

LLaVA
LLaVA copied to clipboard