tensorrtllm_backend
tensorrtllm_backend copied to clipboard
The Triton TensorRT-LLM Backend
### System Info Ryzen 5950x, Ubuntu 22.04, 2 RTX 3090s, main branch ### Who can help? @byshiue @sch ### Information - [X] The official example scripts - [ ] My...
### System Info Linux devserver-ei 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux  nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation...
### System Info CPU - x86_64, Intel(R) Xeon(R) CPU @ 2.20GHz CPU memory - 1.3TB GPUs - Nvidia A100 80GB git commit ID of (TensorRT LLM backend): e432c6a0cc85f9790365067e7e3175e1b2ce3559 TRT-LLM version:...
### System Info 2 * 4 L40s load llama2-70B, 1 model: tensorrt_llm. using image: nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3 ### Who can help? _No response_ ### Information - [X] The official example scripts -...
### System Info - x86_64 cpu - A100 80GB * 2 - using docker container is followed https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#using-the-tensorrt-llm-backend way ### Who can help? @byshiue @schetlur-nv ### Information - [X] The...
build dockerfile, command :"curl -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/$arch/cuda-keyring_1.0-1_all.deb " is not found, open the url is 404 the branch :master docekerfile : dockerfile/Dockerfile.triton.trt_llm_backend
Is this repo support Orion-14B-xx ? Such as: [Orion-14B-LongChat](https://huggingface.co/OrionStarAI/Orion-14B-LongChat)
### System Info **System** CPU architecture X86_64 (EC2 G5.12xlarge) CPU/Host memory size 192GB, 8GB Swap GPU properties GPU name A10Gx4 GPU memory size 24GBx4 (96GB) **Libraries** TensorRT-LLM main TensorRT-LLM [https://github.com/NVIDIA/TensorRT-LLM/commit/0ab9d17a59c284d2de36889832fe9fc7c8697604](https://github.com/NVIDIA/TensorRT-LLM/commit/0ab9d17a59c284d2de36889832fe9fc7c8697604)...
In the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) build.py `parser.add_argument('--max_batch_size', type=int, default=10)` However, when Triton calls the code, `client/inflight_batcher_llm_client.py`, it sends grpc requests at the same time, accepts them and returns them. How does it...
Build the engine using the following code (TensorRT-LLM version v0.7.1) for Baichuan2-13B-Chat model. python build.py --model_version v2_13b \ --model_dir /code/tensorrt_llm/checkpoint/hf_bf16 \ --dtype float16 \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \...