LLaVA [Usage] Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Aborted

Describe the issue

Issue: I can see the gradio web interface but as soon as i use the example prompt, the model controller crashes. I use WSL 2 and every step works, I just can not get any response after i enter a prompt, all I see in gradio is: "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE."

System: Win 10 WSL 2 (Ubuntu) pip 23.3.2 python 3.10.12 RTX3090 64GB RAM Ryzen 9 3950X

Command:

python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b

Log:

2024-01-30 22:11:49 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='liuhaotian/llava-v1.5-13b', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-01-30 22:11:49 | INFO | model_worker | Loading the model llava-v1.5-13b on worker c5ced3 ...
Loading checkpoint shards:   0%|                                                                                                                      | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|████████████████████████████████████▋                                                                         | 1/3 [00:06<00:12,  6.42s/it]
Loading checkpoint shards:  67%|█████████████████████████████████████████████████████████████████████████▎                                    | 2/3 [00:12<00:06,  6.47s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.07s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.45s/it]
2024-01-30 22:12:07 | ERROR | stderr |
2024-01-30 22:12:11 | INFO | model_worker | Register to controller
2024-01-30 22:12:11 | ERROR | stderr | INFO:     Started server process [1348]
2024-01-30 22:12:11 | ERROR | stderr | INFO:     Waiting for application startup.
2024-01-30 22:12:11 | ERROR | stderr | INFO:     Application startup complete.
2024-01-30 22:12:11 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
2024-01-30 22:12:18 | INFO | stdout | INFO:     127.0.0.1:59030 - "POST /worker_get_status HTTP/1.1" 200 OK
2024-01-30 22:12:22 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1
2024-01-30 22:12:22 | INFO | stdout | INFO:     127.0.0.1:40950 - "POST /worker_generate_stream HTTP/1.1" 200 OK
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
Aborted

Screenshots:

Jan 30 '24 21:01 TobiasJu

So i followed this comment: Conda Pytorch (Pytorch channel) in WSL2 Ubuntu can't find libcudnn shared objects](https://github.com/pytorch/pytorch/issues/85773#issuecomment-1288033297)

And now i get this error and still no model output:

2024-01-30 22:43:08 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 3
2024-01-30 22:43:08 | INFO | stdout | INFO:     127.0.0.1:50792 - "POST /worker_generate_stream HTTP/1.1" 200 OK
2024-01-30 22:43:15 | ERROR | stderr | Exception in thread Thread-4 (generate):
2024-01-30 22:43:15 | ERROR | stderr | Traceback (most recent call last):
2024-01-30 22:43:15 | ERROR | stderr |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-01-30 22:43:15 | ERROR | stderr |     self.run()
2024-01-30 22:43:15 | ERROR | stderr |   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-01-30 22:43:15 | ERROR | stderr |     self._target(*self._args, **self._kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-30 22:43:15 | ERROR | stderr |     return func(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
2024-01-30 22:43:15 | ERROR | stderr |     return self.sample(
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2642, in sample
2024-01-30 22:43:15 | ERROR | stderr |     outputs = self(
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/mnt/a/KI/LLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
2024-01-30 22:43:15 | ERROR | stderr |     return super().forward(
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
2024-01-30 22:43:15 | ERROR | stderr |     outputs = self.model(
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
2024-01-30 22:43:15 | ERROR | stderr |     layer_outputs = decoder_layer(
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 420, in forward
2024-01-30 22:43:15 | ERROR | stderr |     hidden_states = self.post_attention_layernorm(hidden_states)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr |     return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr |     output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 89, in forward
2024-01-30 22:43:15 | ERROR | stderr |     return self.weight * hidden_states.to(input_dtype)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 220, in _fn
2024-01-30 22:43:15 | ERROR | stderr |     result = fn(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 130, in _fn
2024-01-30 22:43:15 | ERROR | stderr |     result = fn(**bound.arguments)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_refs/__init__.py", line 926, in _ref
2024-01-30 22:43:15 | ERROR | stderr |     return prim(a, b)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_refs/__init__.py", line 1532, in mul
2024-01-30 22:43:15 | ERROR | stderr |     return prims.mul(a, b)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_ops.py", line 287, in __call__
2024-01-30 22:43:15 | ERROR | stderr |     return self._op(*args, **kwargs or {})
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims/__init__.py", line 346, in _elementwise_meta
2024-01-30 22:43:15 | ERROR | stderr |     utils.check_same_device(*args_, allow_cpu_scalar_tensors=True)
2024-01-30 22:43:15 | ERROR | stderr |   File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/__init__.py", line 596, in check_same_device
2024-01-30 22:43:15 | ERROR | stderr |     raise RuntimeError(msg)
2024-01-30 22:43:15 | ERROR | stderr | RuntimeError: Tensor on device cuda:0 is not on the expected device meta!

Jan 30 '24 21:01 TobiasJu

Doing the symlink trick seemed to just make my system use the CPU for CUDA and the worker won't start if I turn on quantization. Instead, I added

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

to my .bashrc in wsl and I'm able to get to the point where the GPU starts doing something when I submit a query in the web ui. Unfortunately after 5-10 seconds the gradio client still spits out the NETWORK ERROR output and the GPU goes back down to 0%. None of the programs crash.

Here's a ticket on the WSL github with more details about the LD_LIBRARY_PATH fix: https://github.com/microsoft/WSL/issues/8587

Edit: Turns out my other issues were likely VRAM related, works fine with a smaller model than the one I was trying.

Jan 31 '24 18:01 PMahern

Thanks, the VRAM tipp helped! Running the smaller 7B model works on my machine. Looks like the RTX3090 24GB of VRAM is still too less for the 13B. python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-7b

How much VRAM do i need for the 13B Model and how can i see/calculate the VRAM needed for a model to run?

Jan 31 '24 18:01 TobiasJu

You can turn on quantization to reduce the VRAM needed (this will reduce the accuracy as well). I was testing it with 4bit quant but the 13b model might fit with 8bit on a 3090. The flags are --load-4bit and --load-8bit. Just add it to the call to start the worker. Here's the details on the readme

I don't know how much VRAM the individual models need, but the 34b one was a bit too much for my 4090 at 4 bit.

Jan 31 '24 21:01 PMahern

This is due to the VRAM OOM issue. We also recently add the option to enable flash attention for inference which further reduces the memory usage.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.6-34b --load-4bit --use-flash-attn

Feb 03 '24 05:02 haotian-liu