[Usage] Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Aborted
Describe the issue
Issue: I can see the gradio web interface but as soon as i use the example prompt, the model controller crashes. I use WSL 2 and every step works, I just can not get any response after i enter a prompt, all I see in gradio is: "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE."
System: Win 10 WSL 2 (Ubuntu) pip 23.3.2 python 3.10.12 RTX3090 64GB RAM Ryzen 9 3950X
Command:
python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
Log:
2024-01-30 22:11:49 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='liuhaotian/llava-v1.5-13b', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-01-30 22:11:49 | INFO | model_worker | Loading the model llava-v1.5-13b on worker c5ced3 ...
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|████████████████████████████████████▋ | 1/3 [00:06<00:12, 6.42s/it]
Loading checkpoint shards: 67%|█████████████████████████████████████████████████████████████████████████▎ | 2/3 [00:12<00:06, 6.47s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00, 5.07s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00, 5.45s/it]
2024-01-30 22:12:07 | ERROR | stderr |
2024-01-30 22:12:11 | INFO | model_worker | Register to controller
2024-01-30 22:12:11 | ERROR | stderr | INFO: Started server process [1348]
2024-01-30 22:12:11 | ERROR | stderr | INFO: Waiting for application startup.
2024-01-30 22:12:11 | ERROR | stderr | INFO: Application startup complete.
2024-01-30 22:12:11 | ERROR | stderr | INFO: Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
2024-01-30 22:12:18 | INFO | stdout | INFO: 127.0.0.1:59030 - "POST /worker_get_status HTTP/1.1" 200 OK
2024-01-30 22:12:22 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1
2024-01-30 22:12:22 | INFO | stdout | INFO: 127.0.0.1:40950 - "POST /worker_generate_stream HTTP/1.1" 200 OK
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
Aborted
Screenshots:
So i followed this comment: Conda Pytorch (Pytorch channel) in WSL2 Ubuntu can't find libcudnn shared objects](https://github.com/pytorch/pytorch/issues/85773#issuecomment-1288033297)
And now i get this error and still no model output:
2024-01-30 22:43:08 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 3
2024-01-30 22:43:08 | INFO | stdout | INFO: 127.0.0.1:50792 - "POST /worker_generate_stream HTTP/1.1" 200 OK
2024-01-30 22:43:15 | ERROR | stderr | Exception in thread Thread-4 (generate):
2024-01-30 22:43:15 | ERROR | stderr | Traceback (most recent call last):
2024-01-30 22:43:15 | ERROR | stderr | File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-01-30 22:43:15 | ERROR | stderr | self.run()
2024-01-30 22:43:15 | ERROR | stderr | File "/usr/lib/python3.10/threading.py", line 953, in run
2024-01-30 22:43:15 | ERROR | stderr | self._target(*self._args, **self._kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-01-30 22:43:15 | ERROR | stderr | return func(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 1588, in generate
2024-01-30 22:43:15 | ERROR | stderr | return self.sample(
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/generation/utils.py", line 2642, in sample
2024-01-30 22:43:15 | ERROR | stderr | outputs = self(
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr | return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr | output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/mnt/a/KI/LLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
2024-01-30 22:43:15 | ERROR | stderr | return super().forward(
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
2024-01-30 22:43:15 | ERROR | stderr | outputs = self.model(
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr | return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
2024-01-30 22:43:15 | ERROR | stderr | layer_outputs = decoder_layer(
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr | return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr | output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 420, in forward
2024-01-30 22:43:15 | ERROR | stderr | hidden_states = self.post_attention_layernorm(hidden_states)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
2024-01-30 22:43:15 | ERROR | stderr | return forward_call(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
2024-01-30 22:43:15 | ERROR | stderr | output = old_forward(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 89, in forward
2024-01-30 22:43:15 | ERROR | stderr | return self.weight * hidden_states.to(input_dtype)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 220, in _fn
2024-01-30 22:43:15 | ERROR | stderr | result = fn(*args, **kwargs)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/wrappers.py", line 130, in _fn
2024-01-30 22:43:15 | ERROR | stderr | result = fn(**bound.arguments)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_refs/__init__.py", line 926, in _ref
2024-01-30 22:43:15 | ERROR | stderr | return prim(a, b)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_refs/__init__.py", line 1532, in mul
2024-01-30 22:43:15 | ERROR | stderr | return prims.mul(a, b)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_ops.py", line 287, in __call__
2024-01-30 22:43:15 | ERROR | stderr | return self._op(*args, **kwargs or {})
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims/__init__.py", line 346, in _elementwise_meta
2024-01-30 22:43:15 | ERROR | stderr | utils.check_same_device(*args_, allow_cpu_scalar_tensors=True)
2024-01-30 22:43:15 | ERROR | stderr | File "/home/tobias/.local/lib/python3.10/site-packages/torch/_prims_common/__init__.py", line 596, in check_same_device
2024-01-30 22:43:15 | ERROR | stderr | raise RuntimeError(msg)
2024-01-30 22:43:15 | ERROR | stderr | RuntimeError: Tensor on device cuda:0 is not on the expected device meta!
Doing the symlink trick seemed to just make my system use the CPU for CUDA and the worker won't start if I turn on quantization. Instead, I added
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
to my .bashrc in wsl and I'm able to get to the point where the GPU starts doing something when I submit a query in the web ui. Unfortunately after 5-10 seconds the gradio client still spits out the NETWORK ERROR output and the GPU goes back down to 0%. None of the programs crash.
Here's a ticket on the WSL github with more details about the LD_LIBRARY_PATH fix: https://github.com/microsoft/WSL/issues/8587
Edit: Turns out my other issues were likely VRAM related, works fine with a smaller model than the one I was trying.
Thanks, the VRAM tipp helped! Running the smaller 7B model works on my machine. Looks like the RTX3090 24GB of VRAM is still too less for the 13B.
python3 -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-7b
How much VRAM do i need for the 13B Model and how can i see/calculate the VRAM needed for a model to run?
You can turn on quantization to reduce the VRAM needed (this will reduce the accuracy as well). I was testing it with 4bit quant but the 13b model might fit with 8bit on a 3090. The flags are --load-4bit and --load-8bit. Just add it to the call to start the worker. Here's the details on the readme
I don't know how much VRAM the individual models need, but the 34b one was a bit too much for my 4090 at 4 bit.
This is due to the VRAM OOM issue. We also recently add the option to enable flash attention for inference which further reduces the memory usage.
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.6-34b --load-4bit --use-flash-attn