inference BUG：cogvlm2-llama3-chinese-chat-19B model crushed--Expected all tensors to be on the same device

Describe the bug

Due to network restrictions, I cannot use Xinference to pull models online. I downloaded the model weight of cogvlm2-llama3-chinese-chat-19B to the local computer, and then used Xinference (docker container) to register the model cogvlm2-llama3-chinese-chat-19B to cogvlm2-llama3-chinese-chat-19B-self. After that, I started the custom model cogvlm2-llama3-chinese-chat-19B-self on 4 GPUs. The startup failed, and the following error message was displayed: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

To Reproduce

To help us to reproduce this bug, please provide information below:

I use docker image--xprobe/xinference:v0.12.0
Full stack of the error.

Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate result = self._sample( File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample outputs = self( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 649, in forward outputs = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 390, in forward images_features = self.encode_images(images) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/modeling_cogvlm.py", line 362, in encode_images images_features = self.vision(images) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 130, in forward x = self.transformer(x) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 94, in forward hidden_states = layer_module(hidden_states) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/cogvlm2-llama3-chinese-chat-19B/visual.py", line 83, in forward output = mlp_input + mlp_outputRuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

reproduce step: Download the cogvlm2-llama3-chinese-chat-19B model from huggingface to a local directory named /home/llm/image-model/ Run the following command to start the container: docker run -d -v /home/llm/image-model:/root/models -p 9998:9997 --gpus all xprobe/xinference:v0.12.0 xinference-local -H 0.0.0.0 Use the browser to access http://localhost:9998/ui Click Register Model, select the IMAGE MODEL tab, enter the model name "cogvlm2-llama3-chinese-chat-19B-self" and path "/root/models/cogvlm2-llama3-chinese-chat-19B", and click Register Model. Launch the model: cogvlm2-llama3-chinese-chat-19B-self
I used dify to link the model that xinference started, and uploaded a picture to ask the question. Error happened: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:3!

Expected behavior

It is hoped that Xinference can perform cogvlm2-llama3-chinese-chat-19B model inference on multiple GPUs.

Additional context

Add any other context about the problem here.

Jun 12 '24 09:06 majestichou

This issue is stale because it has been open for 7 days with no activity.

Aug 06 '24 06:08 github-actions[bot]

The same!

Oct 21 '24 03:10 CA-TT-AC