TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Support for Zephyr 7B model

Open rishabh279 opened this issue 1 year ago • 4 comments

Hi Nvidia Team,

Please add the support for Zepyhr 7B model.

https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha

rishabh279 avatar Oct 27 '23 06:10 rishabh279

Thanks for your suggestion. Let me add it to the list of models that were requested and we will keep you posted.

Juney

juney-nvidia avatar Oct 28 '23 12:10 juney-nvidia

I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

TheCodeWrangler avatar Nov 30 '23 21:11 TheCodeWrangler

Hi @TheCodeWrangler we have not tested Zephyr so cannot comment. If you do, definitely let us know!

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

TensorRT-LLM is invariant to hyper-parameters, but not architectural changes. ie, if you want to change the number of layers or heads, that's fine, but if you want to replace GQA with SWA or layernorm with RMSnorm, that requires code changes. You can check out the args of the models you're interested in to see which hyperparams are exposed.

ncomly-nvidia avatar Dec 05 '23 00:12 ncomly-nvidia

I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?

I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.

Have you tested it?

teis-e avatar Feb 08 '24 19:02 teis-e

I have tested a trained Zephyr 7b model extensively today, with no success at all. I've used an ensemble repository, debugged to see the in- and outputs, however, the generated TensorRT engine doesn't seem to respond when fed the input tokens from the preprocessor/tokenizer. This I have tested with 2 way Tensor parallelism + 2 way pipeline parallelism, 2 way Tensor parallelism and no parallelism at all, all running on a DGX.

The Engine does actually respond when querying it with the run.py script, which does strike me as odd, so I may have an issue with the actual model setup, so further investigation may be required.

KatarinaMah avatar Apr 11 '24 15:04 KatarinaMah

After another round of testing today, the same engine suddenly does seem to work, to the point where I can't replicate the empty response anymore... So I guess there is at least some level of compatibility?

KatarinaMah avatar Apr 12 '24 08:04 KatarinaMah