TensorRT-LLM
TensorRT-LLM copied to clipboard
Support for Zephyr 7B model
Hi Nvidia Team,
Please add the support for Zepyhr 7B model.
https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha
Thanks for your suggestion. Let me add it to the list of models that were requested and we will keep you posted.
Juney
I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?
I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.
Hi @TheCodeWrangler we have not tested Zephyr so cannot comment. If you do, definitely let us know!
I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.
TensorRT-LLM is invariant to hyper-parameters, but not architectural changes. ie, if you want to change the number of layers or heads, that's fine, but if you want to replace GQA with SWA or layernorm with RMSnorm, that requires code changes. You can check out the args of the models you're interested in to see which hyperparams are exposed.
I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr?
I am hoping to understand what makes a new arch. work/not work with tensorRT-LLM to compile .engine files.
Have you tested it?
I have tested a trained Zephyr 7b model extensively today, with no success at all. I've used an ensemble repository, debugged to see the in- and outputs, however, the generated TensorRT engine doesn't seem to respond when fed the input tokens from the preprocessor/tokenizer. This I have tested with 2 way Tensor parallelism + 2 way pipeline parallelism, 2 way Tensor parallelism and no parallelism at all, all running on a DGX.
The Engine does actually respond when querying it with the run.py script, which does strike me as odd, so I may have an issue with the actual model setup, so further investigation may be required.
After another round of testing today, the same engine suddenly does seem to work, to the point where I can't replicate the empty response anymore... So I guess there is at least some level of compatibility?