DefTruth comments

Results 256 comments of


                                            DefTruth

Add ONNX Export and MNN/TNN/ONNXRuntime C++ Demo

@yucornetto hi~ would you like to review this PR?

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

maybe relate to https://github.com/vllm-project/vllm/pull/5207

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

@youkaichao close, seems the latest vllm (up to #5410) has fixed this problem. (TP0T 45ms v0.4.2 -> 39ms v0.5, eager mode) ```bash [I][2024-06-11 16:31:36][ 1/20][ 1/20 5%] session:0 turn:0 req:0...

[tensorrt-llm backend] A question about launch_triton_server.py

> In tensorrt_llm_backend, when we launch several server by MPI with world_size > 1, only the rank 0 (main process) will recieve/return requests. Other ranks will skip this step and...

[Bug] 使用lmdeploy推理internvl2-40B出错

同样的问题，会偶发在这段卡主： ```bash File "/root/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device return tensor.to(device, non_blocking=non_blocking) ``` trace到是在accelerate send_to_device函数没有返回

当连续请求200多次后，出现突然卡住的情况

同样的问题

当连续请求200多次后，出现突然卡住的情况

@lvhan028 感觉这个问题是个大bug，vl的模型，我用着经常会遇到这个偶发卡住的问题。没有报错，就是hang住不返回。看着像是accelerate和lmdeploy的集合通信互相死锁了，因为请求是异步发出的，vit的推理和llm的推理实际上是流水线重叠的。trace的日志： ```bash -- Stack for thread 23201439544896 --- File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args,...

DefTruth

Add ONNX Export and MNN/TNN/ONNXRuntime C++ Demo

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

[tensorrt-llm backend] A question about launch_triton_server.py

[Bug] 使用lmdeploy推理internvl2-40B出错

当连续请求200多次后，出现突然卡住的情况

当连续请求200多次后，出现突然卡住的情况

当连续请求200多次后，出现突然卡住的情况

Does TensorRT-LLM support passing input_embeds directly？

Does TensorRT-LLM support passing input_embeds directly？