TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

TensorRT-LLM Requests

Open ncomly-nvidia opened this issue 1 year ago • 12 comments

Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: Jan 14th, 2024 🚀 = in development

Models

Decoder Only

  • [ ] 🚀 Zephyr-7B - #157
  • [ ] DeciLM-7B - #853
  • [x] ChatGLM 3 - #180, #270
  • [x] Mistral-7B - #49
  • [x] Mixtral-7B - #616

Encoder / Encoder-Decoder

  • [ ] DeBERTa - #174
  • [ ] RoBERTa - #124
  • [x] 🚀 BART, mBART - #285, #360
  • [x] FLAN-T5 - #251, #285, #310

Multi-Modal

  • [x] BLIP2 + T5 - #310, #531
  • [x] LLaVa - #641,
  • [x] Qwen-VL - #728
  • [x] Generic Vision Encoder + LLM Support - #641, #310
  • [x] BLIP2
  • [x] Whisper - #323

Other

  • [ ] YaRN - #792
  • [ ] Expert Caching - #849
  • [x] LoRA - #68
  • [x] Mixtral - #616

Features & Optimizations

  • [x] Context Chunking - #317
  • [x] Speculative Decoding - #169, #224, #226 implementation done - documentation in progress

KV Cache

  • [x] Reuse KV Cache - #292, #620
  • [x] Attention Sinks (StreamingLLM, H2O) - #104

Quantization

  • [ ] StarCoder INT8 SQ - #324
  • [x] Qwen INT4 - #345
  • [x] INT8 Weight only - #110

Sampling

  • [ ] 🚀 support frequnecy_penalty - #275
  • [ ] Logit Manipulators - #241
  • [x] Combine repetition & presence penalties - #274

Workflow

Front-ends

  • [ ] OpenAI compatible API - #334
  • [ ] Flag for end-of-stream - #240
  • [ ] Load from Buffer - #144
  • [x] Paged KV Cache Utilization Metric - #512
  • [x] Log Probabilities - #238
  • [x] Return only new tokens - #227

Integrations

  • [ ] 🚀 LlamaIndex
  • [ ] 🚀 LangChain
  • [ ] Mojo - #556

Usage / Installation

  • [x] pip install - #790,

Platform Support

  • [ ] Jetson - #62, #488, #619
  • [ ] V100, T4 MHA - #320

ncomly-nvidia avatar Dec 11 '23 19:12 ncomly-nvidia

Please add CohereAI!!

CohereForAI/c4ai-command-r-plus

teis-e avatar Apr 04 '24 18:04 teis-e

Llama 3 would be great (both 8B and 70B): https://github.com/NVIDIA/TensorRT-LLM/issues/1470

Maybe quantized to 8 or even 4 bit.

EwoutH avatar Apr 22 '24 09:04 EwoutH

currently llama 3 throws a bunch of errors converting to TensorRT LLM

any ideal about the support for llama 3

StephennFernandes avatar Apr 22 '24 22:04 StephennFernandes

Phi-3-mini should be amazing! Such a small 3.8B model could run quantized on many GPUs, with as little as 4GB VRAM.

  • Paper: https://arxiv.org/abs/2404.14219
  • Model weights: https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

EwoutH avatar Apr 23 '24 15:04 EwoutH

+1 for Phi-3

oscarbg avatar May 04 '24 14:05 oscarbg

+1 for Command R Plus!

CohereForAI/c4ai-command-r-plus

user-0a avatar May 18 '24 05:05 user-0a

hello @ncomly-nvidia, I am a student interested in the project! I want to ask if there are any good-first-issue feature request for Features & Optimizations recently? 🤣

khan-yin avatar Jun 25 '24 16:06 khan-yin

+1 for OpenBMB/MiniCPM-V-2

chenpinganan avatar Jul 02 '24 11:07 chenpinganan

Any news on support for jetson platform? Thanks in advance.

FenardH avatar Aug 05 '24 07:08 FenardH

Requesting support for Meta's m4t v2 model, like how whisper support is provided.

How is it going for Jetson AGX ? It would be nice if all is compatible before Jetson Thor launch

johnnynunez avatar Sep 25 '24 06:09 johnnynunez

LLaMa 3.2 multimodal vision models anytime soon?

ampdot-io avatar Sep 28 '24 04:09 ampdot-io