Ryan McCormick
Ryan McCormick
Hi @rvroge, thanks for contributing the PR! CC'ing a couple folks who may have some more context @oandreeva-nv @nv-kmcgill53
Hi @chenchunhui97, >> generate onnx for server (with torch version 2.1.2 ) If your `bert` model is an ONNX model, then you should be specifying the `onnxruntime` backend in the...
This seems like a reasonable and relatively simple request to me. @GuanLuo @nnshah1 what do you think?
Hi @chenchunhui97, this example is pretty old, but may work: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/triton/README.md
Testing PR: https://github.com/triton-inference-server/server/pull/7756
Hi @Vaishnvi, thanks for sharing such detailed info. Since this is an ONNX model, and the [ORT backend supports full config auto-complete](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#auto-generated-model-configuration), can you try to load the model without...
Hi @ShuaiShao93, thanks for raising this issue. To clarify, are these errors causing the inference requests to fail? Or is it just logging errors without affecting inference? CC @krishung5 @indrajit96
Thanks for clarifying! I'm going to move this issue to the TRT-LLM Backend repo for further help.
Hi @jasonngap1, I'm transferring this issue to the TRT-LLM backend: https://github.com/triton-inference-server/tensorrtllm_backend for help. CC @schetlur-nv @pcastonguay
Hi @chorus-over-flanger @faileon, thanks for expressing interest in the `/embeddings` route! It's on our radar as another feature to add to the OpenAI frontend support (chat, completions, and models) added...