Flex Wang issues

Results 10 issues of


                                            Flex Wang

flashattention only enabled for gpt-styled models

Saw code here: https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h#L72 Any reason, flashattention shouldn't be used for encoder-only model?

decoupled model with non-streaming mode

Looks like if I set model as decoupled, I can still query it with non-streaming mode. Is this expected behavior? What is the latency impact here?

decoupled mode not working when beam_width > 1

I am running t5 decoder model with fastertransformer, it seems that, if I set beam_width>1, the result that is stream back are just garbage tokens. Is this expected?

Deadlock detected

``` Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations. ``` Constantly see this issue when running below on...

Is there a way for mii to not occupy all the available gpu memory

When I run examples in [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/benchmarks/inference/mii/server.py) to start a server, it occupy all the GPU memory at the beginning. Is it possible to config the max gpu memory that it...

Is there gonna be metrics endpoint exposed?

What is the recommended way of bringing up mii as a service

My understanding is that we have to build a fastAPI wrapper, and during intialized phase we call `client = mii.client("mistralai/Mistral-7B-v0.1")` and we implement a handler to call `client.generate`.

Is beam search supported?

Can I enable streaming on an ensemble model?

In the ensemble model example for [gpt](https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gpt), can I change the `fastertransformer` model to a `decoupled` model and enable streaming on the client side?

is_return_log_probs is required for decoupled model?

Looks like if `is_return_log_probs` is set to `False`, then the decoupled model does not return anything.