zhaotyer
zhaotyer
**Description** When I use tritonserever22.02 for dynamic batch inference, the coredump will occasionally appear in the first inference after the model is loaded successfully   **Triton Information** nvcr.io/nvidia/tritonserver:22.02-py3 Are...
I tried to integrate mii into tritonserver, but encountered some problems Below is part of my code ``` class TritonPythonModel: def initialize(self, args): import mii from transformers import AutoTokenizer tensor_parallel_size...
Test environment 1*A100*80G | vllm==0.2.6+cu118 | deepspeed-mii==0.2.0 | Llama-2-7b-chat-hf script:[https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii](url) Test Result:  Why is the performance lower than vllm?
### Your current environment ```text The output of `python collect_env.py` Collecting environment information... PyTorch version: 2.3.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build...
### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [x] 2. The bug has not been fixed in the latest version. -...
### Proposal to improve performance _No response_ ### Report of performance regression _No response_ ### Misc discussion on performance vllm command `python3 -m vllm.entrypoints.openai.api_server --model ${model_path} --port 8108 --max-model-len 6500...
### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build...