Yuchao Zhang
Yuchao Zhang
I think u could customize the logic in [postprocess](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/postprocessing/1/model.py) and [preprocess](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/preprocessing/1/model.py) to do the calculation.
@nirga thanks for this update. Any eta to publish a new version so we can use it for auto instrumentation with opentelemetry-instrument?
how does it works?
Sorry, I think I misunderstand the `n` in trtllm, where I have expected multiple beam would be returned. According to this thread, https://github.com/triton-inference-server/tensorrtllm_backend/issues/499, maybe I need to make multiple requests...
By the way, do you know what choice.index would be like when using `stream` along with `n>1`?
I did not find this option in https://platform.openai.com/docs/api-reference/chat/create
@visitsb It's fine to add `/v1/models`. But the list of [full openai api](https://platform.openai.com/docs/api-reference/introduction) is long, like `/v1/audio`, `/v1/embedding`. What's the minimal subset is needed?
The exposed API depends on the actual model hosted in triton backend. Since there's no embedding model available in trtllm, `/v1/embeddings` is not possible. For embedding model, maybe you can...
@0xMochan Happy new year. Kindly ask are the team back to office? Here's the latest openai openapi spec on `tool_calls` https://github.com/openai/openai-openapi/blob/d2eaa350b5b619ad6355384279a9beb9d423d88b/openapi.yaml#L12733-L12737 and they accept empty list like https://github.com/openai/openai-openapi/blob/d2eaa350b5b619ad6355384279a9beb9d423d88b/openapi.yaml#L7781
I'm not sure if the input tokens have exceeded the max tokens. You can also check the postprocessing part in triton to debug the generated tokens if possible.