Nathan Price comments

Results 28 comments of


                                            Nathan Price

Support for Zephyr 7B model

I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr? I am hoping to understand what makes a new arch....

use_fp8_context_fmha broken outputs

I am experiencing similar issues I am using LLAMA3 8B with lora weights. I get significantly worse results when making calls concurrently than I do when running one at a...

Warmup Example of loading LoRa weights

I am actually hoping to understand how to perform warmup within the triton-inference-server frame work with the LoRa weights https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup It would be nice for my client code to not...

Warmup Example of loading LoRa weights

I have been able to get warmup to load within triton-inference-server and initialize my weights BUT due to the degraded performance of my model outputs I am suspicious I have...

Warmup Example of loading LoRa weights

I was able to resolve this.. degradation was caused by alpha scaling not being applied by the provided conversation scripts (now resolved in a PR) I was able to get...

Support bfloat16 LoRa Adaptors

How can i pass BFLOAT adapter weights to the backend? How could I build the pbtext message in "preprocessing" for example to be of bfloat16 datatype? What updates do i...

Support bfloat16 LoRa Adaptors

I would like to manage loading the lora weights on the first call to that adapter in my preprocessing model.py I am not sure how to package the weights as...

Feature Request: Set maximum number of in flight

Any plans to add this as a controllable feature? Any other alternative suggestions on how I can keep the internal queue getting too large for a single instance

Example of LoRa weights

Thank you for pointing me to this! Things that this helped clear up (and may help someone in the future). Starting with .safetensors from hugingface you need to convert them...

Conversion of "hf_lora_convert.py" does not account for "lora_alpha"

I am also concerned that other paramters in the "adapter_config.json" would not be used by tensorrt-llm "lora_dropout" for example