nguyenhoangthuan99
nguyenhoangthuan99
I have successfully converted open pose 25 pose-model (version 1.6.0) to TensorRT but the issue is when I run inference in python, the post-process is very slow. I used post-processing...
Currently the example server for cortex.llamacpp and cortex.tensorrtllm can get the following resuls: With avg contex length 400: - cortex.llamacpp: 850 token/s - cortex.tensorrt-llm: 1450 token/s We need to benchmark...
# Add Multi-GPU Support for LlamaCpp Engine ## Description We need to implement multi-GPU support for our LlamaCpp wrapper engine to improve performance and allow users to utilize multiple GPUs...
We need create unitest for done ticket. For now we will use [Gtest](https://github.com/google/googletest) to write unitest Unitest can run locally and add to CI pipeline When building debug mode will...
### Feature request The current codebase only supports bf16/fp16 training, while we typically apply quantization (int8, int4, fp8, fp4) during model serving to reduce VRAM usage while still maintaining accuracy....