TensorRT-LLM Support for Zephyr 7B model

Hi Nvidia Team,

Please add the support for Zepyhr 7B model.

https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha

Oct 27 '23 06:10 rishabh279

Thanks for your suggestion. Let me add it to the list of models that were requested and we will keep you posted.

Juney

Oct 28 '23 12:10 juney-nvidia

I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

Nov 30 '23 21:11 TheCodeWrangler

Hi @TheCodeWrangler we have not tested Zephyr so cannot comment. If you do, definitely let us know!

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

TensorRT-LLM is invariant to hyper-parameters, but not architectural changes. ie, if you want to change the number of layers or heads, that's fine, but if you want to replace GQA with SWA or layernorm with RMSnorm, that requires code changes. You can check out the args of the models you're interested in to see which hyperparams are exposed.

Dec 05 '23 00:12 ncomly-nvidia

I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

Have you tested it?

Feb 08 '24 19:02 teis-e

I have tested a trained Zephyr 7b model extensively today, with no success at all. I've used an ensemble repository, debugged to see the in- and outputs, however, the generated TensorRT engine doesn't seem to respond when fed the input tokens from the preprocessor/tokenizer. This I have tested with 2 way Tensor parallelism + 2 way pipeline parallelism, 2 way Tensor parallelism and no parallelism at all, all running on a DGX.

The Engine does actually respond when querying it with the run.py script, which does strike me as odd, so I may have an issue with the actual model setup, so further investigation may be required.

Apr 11 '24 15:04 KatarinaMah

After another round of testing today, the same engine suddenly does seem to work, to the point where I can't replicate the empty response anymore... So I guess there is at least some level of compatibility?

Apr 12 '24 08:04 KatarinaMah

TensorRT-LLM TensorRT-LLM copied to clipboard

Support for Zephyr 7B model

TensorRT-LLM
TensorRT-LLM copied to clipboard