fastertransformer_backend icon indicating copy to clipboard operation
fastertransformer_backend copied to clipboard

Results 75 fastertransformer_backend issues
Sort by recently updated
recently updated
newest added

In the ensemble model example for [gpt](https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gpt), can I change the `fastertransformer` model to a `decoupled` model and enable streaming on the client side?

Really appreciate the awesome work by the team - have managed to get almost a x100 speedup so far with the `fastertransformer_backend` on triton compared to plain PyTorch with a...

### Description ```shell Dockerfile: faster_transformer(v1.2) Model: GPT-J ``` ### Reproduced Steps The streaming example in issue_requests.py throws the following error when passing in a request: ```shell Traceback (most recent call...

bug

I think `ARG SM=80` is required if I am to build the FasterTransformer library, but what about this FasterTransformer backend?

Looks like if `is_return_log_probs` is set to `False`, then the decoupled model does not return anything.

Running python3 tools/end_to_end_test_llama.py, an error was prompted, [400] HTTP end point doesn't support models with decoupled transaction policy

This is a very basic change to the README but I think an important one if users are going to realize they can use 23.05.

In a production environment like ChatGPT, early termination of a conversation based on user-client commands can be a major requirement. I'm wondering whether a grpc streaming request can be terminated...

Hello. Than you for your work and framework! My goal is to host n instances of GPTJ-6B on N graphic cards. I want to have N instances with one model...

existing FT backend will throw error for llama model.