lightllm 使用多卡加载模型，推理时报错

您好，我使用单块A800进行部署推理时正常，但是使用多卡推理会报错： `Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer ` 这是什么原因呢

Aug 29 '23 06:08 wx971025

看起来像是socket 通信出了问题

Aug 29 '23 06:08 hiworldwzj

我也在多卡推理的时候报错了，chatglm2报错，llama2好像没问题

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0, Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func return ans.value File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value raise self._obj _get_exception_class..Derived:

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,

Aug 29 '23 08:08 UncleFB

@UncleFB chatglm2 这个小模型的维度有点特殊，目前还没有支持多卡。而且这个规模的小模型，单卡的性能已经很不错了。

Aug 29 '23 08:08 hiworldwzj

@llehtahw 这个错误你怎么看。

Aug 29 '23 08:08 hiworldwzj

看起来rpyc的进程在prefill过程中崩掉了，可能是异常被吞了，也可能是个segfault

Aug 29 '23 08:08 llehtahw

llama2 70b在2个A800上推理，出现过一次这个问题。 09-15 17:48:28: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer

Sep 18 '23 08:09 ChristineSeven

@ChristineSeven 看起来是推理后端挂掉了，瞎猜可能是参数配置不合理，爆显存把推理进程干掉了。当然如果是共用的机器，更有可能是有人操作杀掉了你的推理进程。

Sep 18 '23 09:09 hiworldwzj

--tp 2 --max_total_token_num 12000 --max_req_input_len 3000 --max_req_total_len 8192 --max_new_tokens 4096 --top_k 30 --top_p 0.85 --temperature 0.5 --do_sample True 从并行度上看，max_total_token_num 可以设置更大。从当前配置看，是参数配置不合理导致的推理进程crash么？目前是第一次看见这种问题。 @hiworldwzj

Sep 18 '23 09:09 ChristineSeven

@ChristineSeven 你这个配置应该不存在显存上的问题，可能是其他原因导致的。是不是别人后台操作过啥，我个人经验是共享的机器都很容易被误伤。

Sep 18 '23 10:09 hiworldwzj

我们有个系统，已分配的卡不会再分配给别人，大家都是在系统上提交的。除非少数人登录机器操作。理论上不会存在共享的问题。

Sep 18 '23 10:09 ChristineSeven

@ChristineSeven 嗯，理解。可能会存在其他问题，不过需要找到复现条件才好定位。还有就是建议使用 triton 2.1.0 的版本做长期部署用，triton 2.0.0 存在内存泄露的bug，可能会导致崩溃。

Sep 18 '23 10:09 hiworldwzj

@hiworldwzj 有可能是这个问题。我观察到使用过程中显存占用有增大的趋势。

Sep 21 '23 06:09 ChristineSeven

您好，我使用单块A800进行部署推理时正常，但是使用多卡推理会报错： `Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer ` 这是什么原因呢

docker启动的时候加上--shm-size 参数就ok了

Sep 25 '23 05:09 wx971025

您好，我使用单块A800进行部署推理时正常，但是使用多卡推理会报错： Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in _func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer 这是什么原因呢

docker启动的时候加上--shm-size 参数就ok了

这个 nccl 多卡通信，在容器启动得时候确实需要开一个比较大得shm-size, 多谢勘误，久了都忘记这个环境限制了。

Sep 25 '23 06:09 hiworldwzj

@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ，还是会有这个问题。

10-12 22:27:39: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer")

Oct 13 '23 02:10 ChristineSeven

@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ，还是会有这个问题。

10-12 22:27:39: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer")

@ChristineSeven 抱歉，我确实是因为默认的shm-size过小导致的，建议您可以free一下看看内存占用的情况

Oct 13 '23 05:10 wx971025

shm-size大概使用了258g这样。

Oct 13 '23 05:10 ChristineSeven

我看源码中似乎预料到这个地方会出问题，加了这个注释，# raise if exception https://github.com/ModelTC/lightllm/blob/main/lightllm/server/router/model_infer/model_rpc.py#L204 @llehtahw @hiworldwzj 这个你们当时是在什么场景下遇见和这个问题的啊？

Oct 13 '23 05:10 ChristineSeven

@hiworldwzj 有什么修改建议不？

Oct 16 '23 10:10 ChristineSeven

我也在多卡推理的时候报错了，chatglm2报错，llama2好像没问题

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0, Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func return ans.value File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value raise self._obj _get_exception_class..Derived:

========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError

/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,

这个问题有大佬解决了吗？

Nov 01 '23 07:11 CXH19940504

@CXH19940504 chatglm2 因为结构有点特殊，暂时没有支持多卡，单卡应该是正常的。

Nov 01 '23 07:11 hiworldwzj

@CXH19940504 我今天看了一下chatglm2的代码，感觉确实可能有点问题，稍等确认修复一下。

Nov 01 '23 08:11 hiworldwzj

@hiworldwzj chatglm2多卡的问题修复了吗? 我在8卡3090上我用两卡是模型可以加载成功(推理的时候报错），4卡和8卡加载模型就报错了。我用的启动CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -m lightllm.server.api_server --model_dir XXX/chatglm2-6b --tp 8 --max_total_token_num 121060 --max_req_total_len 4096 --tokenizer_mode auto --trust_remote_code

Nov 08 '23 07:11 chaizhongming

@chaizhongming chatglm2 目前只能单卡跑，最近修复了一个版本还没有合并，可以支持双卡跑，想支持更多卡跑，还需要更多的适配，请稍等更新，不过chatglm2这种规模的模型，单卡双卡应该是性价比比较高的了。

Nov 08 '23 07:11 hiworldwzj