Flex Wang
Flex Wang
Saw code here: https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h#L72 Any reason, flashattention shouldn't be used for encoder-only model?
Looks like if I set model as decoupled, I can still query it with non-streaming mode. Is this expected behavior? What is the latency impact here?
I am running t5 decoder model with fastertransformer, it seems that, if I set beam_width>1, the result that is stream back are just garbage tokens. Is this expected?
``` Deadlock detected. Resetting KV cache and recomputing requests. Consider limiting number of concurrent requests or decreasing max lengths of prompts/generations. ``` Constantly see this issue when running below on...
When I run examples in [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/blob/master/benchmarks/inference/mii/server.py) to start a server, it occupy all the GPU memory at the beginning. Is it possible to config the max gpu memory that it...
My understanding is that we have to build a fastAPI wrapper, and during intialized phase we call `client = mii.client("mistralai/Mistral-7B-v0.1")` and we implement a handler to call `client.generate`.
In the ensemble model example for [gpt](https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gpt), can I change the `fastertransformer` model to a `decoupled` model and enable streaming on the client side?
Looks like if `is_return_log_probs` is set to `False`, then the decoupled model does not return anything.