使用多卡加载模型,推理时报错
您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错: `Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer ` 这是什么原因呢
看起来像是socket 通信出了问题
我也在多卡推理的时候报错了,chatglm2报错,llama2好像没问题
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0, Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=
========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError
Traceback (most recent call last):
File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
await self._step()
File "/lightllm/lightllm/server/router/manager.py", line 112, in _step
await self._prefill_batch(self.running_batch)
File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
ans = await asyncio.gather(*rets)
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch
return await ans
File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func
return ans.value
File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value
raise self._obj
_get_exception_class.
========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,
@UncleFB chatglm2 这个小模型的维度有点特殊,目前还没有支持多卡。而且这个规模的小模型,单卡的性能已经很不错了。
@llehtahw 这个错误你怎么看。
看起来rpyc的进程在prefill过程中崩掉了,可能是异常被吞了,也可能是个segfault
llama2 70b在2个A800上推理,出现过一次这个问题。 09-15 17:48:28: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer
@ChristineSeven 看起来是推理后端挂掉了,瞎猜可能是参数配置不合理,爆显存把推理进程干掉了。当然如果是共用的机器,更有可能是有人操作杀掉了你的推理进程。
--tp 2 --max_total_token_num 12000 --max_req_input_len 3000 --max_req_total_len 8192 --max_new_tokens 4096 --top_k 30 --top_p 0.85 --temperature 0.5 --do_sample True 从并行度上看,max_total_token_num 可以设置更大。从当前配置看,是参数配置不合理导致的推理进程crash么?目前是第一次看见这种问题。 @hiworldwzj
@ChristineSeven 你这个配置应该不存在显存上的问题,可能是其他原因导致的。是不是别人后台操作过啥,我个人经验是共享的机器都很容易被误伤。
我们有个系统,已分配的卡不会再分配给别人,大家都是在系统上提交的。除非少数人登录机器操作。理论上不会存在共享的问题。
@ChristineSeven 嗯,理解。可能会存在其他问题,不过需要找到复现条件才好定位。还有就是建议使用 triton 2.1.0 的版本做长期部署用,triton 2.0.0 存在内存泄露的bug,可能会导致崩溃。
@hiworldwzj 有可能是这个问题。我观察到使用过程中显存占用有增大的趋势。
您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错: `Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer ` 这是什么原因呢
docker启动的时候加上--shm-size 参数就ok了
您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错:
Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=EOFError(ConnectionResetError(104, 'Connection reset by peer'))> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 187, in _func await asyncio.to_thread(ans.wait) File "/opt/conda/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/opt/conda/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 277, in read raise EOFError(ex) EOFError: [Errno 104] Connection reset by peer这是什么原因呢docker启动的时候加上--shm-size 参数就ok了
这个 nccl 多卡通信,在容器启动得时候确实需要开一个比较大得shm-size, 多谢勘误,久了都忘记这个环境限制了。
@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ,还是会有这个问题。
10-12 22:27:39: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer")
@hiworldwzj @wx971025 看起来不像是shm-size 的问题。llama 70b启动时给了3T的shm-size ,还是会有这个问题。
10-12 22:27:39: Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /app/lightllm-main/lightllm/server/router/manager.py:88> exception=EOFError('connection closed by peer')> Traceback (most recent call last): File "/app/lightllm-main/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/app/lightllm-main/lightllm/server/router/manager.py", line 134, in _step await self._decode_batch(self.running_batch) File "/app/lightllm-main/lightllm/server/router/manager.py", line 162, in _decode_batch ans = await asyncio.gather(*rets) File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 225, in decode_batch return await ans File "/app/lightllm-main/lightllm/server/router/model_infer/model_rpc.py", line 178, in func await asyncio.to_thread(ans.wait) File "/root/miniconda3/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/root/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/async.py", line 51, in wait self._conn.serve(self._ttl) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/root/miniconda3/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer")
@ChristineSeven 抱歉,我确实是因为默认的shm-size过小导致的,建议您可以free一下看看内存占用的情况
@wx971025 是的呢,同一个现象可能是不同的原因导致的。目前两个卡,一个卡上的进程挂掉了,另一个卡大概是这样。 7 NVIDIA A800-SXM... On | 00000000:D3:00.0 Off | 0 | | N/A 31C P0 84W / 400W | 78522MiB / 81920MiB | 100% Default | | | | Disabled |
shm-size大概使用了258g这样。
我看源码中似乎预料到这个地方会出问题,加了这个注释,# raise if exception https://github.com/ModelTC/lightllm/blob/main/lightllm/server/router/model_infer/model_rpc.py#L204 @llehtahw @hiworldwzj 这个你们当时是在什么场景下遇见和这个问题的啊?
@hiworldwzj 有什么修改建议不?
我也在多卡推理的时候报错了,chatglm2报错,llama2好像没问题
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0, Task exception was never retrieved future: <Task finished name='Task-6' coro=<RouterManager.loop_for_fwd() done, defined at /lightllm/lightllm/server/router/manager.py:88> exception=
========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError
Traceback (most recent call last): File "/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd await self._step() File "/lightllm/lightllm/server/router/manager.py", line 112, in _step await self._prefill_batch(self.running_batch) File "/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch ans = await asyncio.gather(*rets) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 227, in prefill_batch return await ans File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 189, in func return ans.value File "/opt/conda/lib/python3.9/site-packages/rpyc/core/async.py", line 108, in value raise self._obj _get_exception_class..Derived:
========= Remote Traceback (1) ========= Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 359, in _dispatch_request res = self._HANDLERS[handler](self, *args) File "/opt/conda/lib/python3.9/site-packages/rpyc/core/protocol.py", line 837, in _handle_call return obj(*args, **dict(kwargs)) File "/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func result = func(*args, **kwargs) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 96, in exposed_prefill_batch return self.forward(batch_id, is_prefill=True) File "/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 147, in forward logits = self.model.forward(**kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 125, in forward return self._prefill(batch_size, total_token_num, max_len_in_batch, input_ids, b_loc, b_start_loc, b_seq_len) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 149, in _prefill predict_logics = self._context_forward(input_ids, infer_state) File "/lightllm/lightllm/common/basemodel/basemodel.py", line 189, in _context_forward input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i]) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 129, in context_forward self._context_attention(input_embdings, File "/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func ans = func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 83, in _context_attention self._post_cache_kv(cache_k, cache_v, infer_state, layer_weight) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 55, in _post_cache_kv self._copy_kv_to_mem_cache(cache_k, cache_v, infer_state.prefill_mem_index, mem_manager) File "/lightllm/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 94, in copy_kv_to_mem_cache destindex_copy_kv(key_buffer, mem_index, mem_manager.key_buffer[self.layer_num]) File "/opt/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/lightllm/lightllm/common/basemodel/triton_kernel/destindex_copy_kv.py", line 36, in destindex_copy_kv assert K.shape[1] == Out.shape[1] and K.shape[2] == Out.shape[2] AssertionError
/lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:30: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.k_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.k_weight_, beta=1.0, alpha=1.0, /lightllm/lightllm/models/chatglm2/layer_infer/transformer_layer_infer.py:33: UserWarning: An output with one or more elements was resized since it had shape [6, 128], which does not match the required output shape [6, 256]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/aten/src/ATen/native/Resize.cpp:26.) torch.addmm(layer_weight.v_bias_, input_emb.view(-1, self.embed_dim_), layer_weight.v_weight_, beta=1.0, alpha=1.0,
这个问题有大佬解决了吗?
@CXH19940504 chatglm2 因为结构有点特殊,暂时没有支持多卡,单卡应该是正常的。
@CXH19940504 我今天看了一下chatglm2的代码,感觉确实可能有点问题,稍等确认修复一下。
@hiworldwzj chatglm2多卡的问题修复了吗? 我在8卡3090上我用两卡是模型可以加载成功(推理的时候报错),4卡和8卡加载模型就报错了。
我用的启动CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 python -m lightllm.server.api_server --model_dir XXX/chatglm2-6b --tp 8 --max_total_token_num 121060 --max_req_total_len 4096 --tokenizer_mode auto --trust_remote_code
@chaizhongming chatglm2 目前只能单卡跑,最近修复了一个版本还没有合并,可以支持双卡跑,想支持更多卡跑,还需要更多的适配,请稍等更新,不过chatglm2这种规模的模型,单卡双卡应该是性价比比较高的了。