Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: Jan 14th, 2024 🚀 = in development

Models

Decoder Only

[ ] 🚀 Zephyr-7B - #157
[ ] DeciLM-7B - #853
[x] ChatGLM 3 - #180, #270
[x] Mistral-7B - #49
[x] Mixtral-7B - #616

Encoder / Encoder-Decoder

[ ] DeBERTa - #174
[ ] RoBERTa - #124
[x] 🚀 BART, mBART - #285, #360
[x] FLAN-T5 - #251, #285, #310

Multi-Modal

[x] BLIP2 + T5 - #310, #531
[x] LLaVa - #641,
[x] Qwen-VL - #728
[x] Generic Vision Encoder + LLM Support - #641, #310
[x] BLIP2
[x] Whisper - #323

Other

[ ] YaRN - #792
[ ] Expert Caching - #849
[x] LoRA - #68
[x] Mixtral - #616

Features & Optimizations

[x] Context Chunking - #317
[x] Speculative Decoding - #169, #224, #226 implementation done - documentation in progress

KV Cache

[x] Reuse KV Cache - #292, #620
[x] Attention Sinks (StreamingLLM, H2O) - #104

Quantization

[ ] StarCoder INT8 SQ - #324
[x] Qwen INT4 - #345
[x] INT8 Weight only - #110

Sampling

[ ] 🚀 support frequnecy_penalty - #275
[ ] Logit Manipulators - #241
[x] Combine repetition & presence penalties - #274

Workflow

Front-ends

[ ] OpenAI compatible API - #334
[ ] Flag for end-of-stream - #240
[ ] Load from Buffer - #144
[x] Paged KV Cache Utilization Metric - #512
[x] Log Probabilities - #238
[x] Return only new tokens - #227

Integrations

[ ] 🚀 LlamaIndex
[ ] 🚀 LangChain
[ ] Mojo - #556

Usage / Installation

[x] pip install - #790,

Platform Support

[ ] Jetson - #62, #488, #619
[ ] V100, T4 MHA - #320

Dec 11 '23 19:12 ncomly-nvidia

Please add CohereAI!!

CohereForAI/c4ai-command-r-plus

Apr 04 '24 18:04 teis-e

Llama 3 would be great (both 8B and 70B): https://github.com/NVIDIA/TensorRT-LLM/issues/1470

Maybe quantized to 8 or even 4 bit.

Apr 22 '24 09:04 EwoutH

currently llama 3 throws a bunch of errors converting to TensorRT LLM

any ideal about the support for llama 3

Apr 22 '24 22:04 StephennFernandes

Phi-3-mini should be amazing! Such a small 3.8B model could run quantized on many GPUs, with as little as 4GB VRAM.

Paper: https://arxiv.org/abs/2404.14219
Model weights: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

Apr 23 '24 15:04 EwoutH

+1 for Phi-3

May 04 '24 14:05 oscarbg

+1 for Command R Plus!

CohereForAI/c4ai-command-r-plus

May 18 '24 05:05 user-0a

hello @ncomly-nvidia, I am a student interested in the project! I want to ask if there are any good-first-issue feature request for Features & Optimizations recently? 🤣

Jun 25 '24 16:06 khan-yin

+1 for OpenBMB/MiniCPM-V-2

Jul 02 '24 11:07 chenpinganan

Any news on support for jetson platform? Thanks in advance.

Aug 05 '24 07:08 FenardH

Requesting support for Meta's m4t v2 model, like how whisper support is provided.

Sep 03 '24 09:09 anubhav-agrawal-mu-sigma

How is it going for Jetson AGX ? It would be nice if all is compatible before Jetson Thor launch

Sep 25 '24 06:09 johnnynunez

LLaMa 3.2 multimodal vision models anytime soon?

Sep 28 '24 04:09 ampdot-io

TensorRT-LLM
TensorRT-LLM copied to clipboard

TensorRT-LLM Requests

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

TensorRT-LLM TensorRT-LLM copied to clipboard

TensorRT-LLM Requests

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

TensorRT-LLM
TensorRT-LLM copied to clipboard