fastertransformer_backend
fastertransformer_backend copied to clipboard
In the ensemble model example for [gpt](https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gpt), can I change the `fastertransformer` model to a `decoupled` model and enable streaming on the client side?
Really appreciate the awesome work by the team - have managed to get almost a x100 speedup so far with the `fastertransformer_backend` on triton compared to plain PyTorch with a...
### Description ```shell Dockerfile: faster_transformer(v1.2) Model: GPT-J ``` ### Reproduced Steps The streaming example in issue_requests.py throws the following error when passing in a request: ```shell Traceback (most recent call...
I think `ARG SM=80` is required if I am to build the FasterTransformer library, but what about this FasterTransformer backend?
Looks like if `is_return_log_probs` is set to `False`, then the decoupled model does not return anything.
Running python3 tools/end_to_end_test_llama.py, an error was prompted, [400] HTTP end point doesn't support models with decoupled transaction policy
This is a very basic change to the README but I think an important one if users are going to realize they can use 23.05.
In a production environment like ChatGPT, early termination of a conversation based on user-client commands can be a major requirement. I'm wondering whether a grpc streaming request can be terminated...
Hello. Than you for your work and framework! My goal is to host n instances of GPTJ-6B on N graphic cards. I want to have N instances with one model...
existing FT backend will throw error for llama model.