Nathan Price
Nathan Price
I have heard that the architecture of Zephyr is very similar to LLama. Does tensorRT-LLM not work currently on Zephyr? I am hoping to understand what makes a new arch....
I am experiencing similar issues I am using LLAMA3 8B with lora weights. I get significantly worse results when making calls concurrently than I do when running one at a...
I am actually hoping to understand how to perform warmup within the triton-inference-server frame work with the LoRa weights https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#model-warmup It would be nice for my client code to not...
I have been able to get warmup to load within triton-inference-server and initialize my weights BUT due to the degraded performance of my model outputs I am suspicious I have...
I was able to resolve this.. degradation was caused by alpha scaling not being applied by the provided conversation scripts (now resolved in a PR) I was able to get...
How can i pass BFLOAT adapter weights to the backend? How could I build the pbtext message in "preprocessing" for example to be of bfloat16 datatype? What updates do i...
I would like to manage loading the lora weights on the first call to that adapter in my preprocessing model.py I am not sure how to package the weights as...
Any plans to add this as a controllable feature? Any other alternative suggestions on how I can keep the internal queue getting too large for a single instance
Thank you for pointing me to this! Things that this helped clear up (and may help someone in the future). Starting with .safetensors from hugingface you need to convert them...
I am also concerned that other paramters in the "adapter_config.json" would not be used by tensorrt-llm "lora_dropout" for example