infinity Write a custom flash-attention function for the deberta model.

Write a custom flash-attention function for the deberta model.

Open wolfassi123 opened this issue 5 months ago • 1 comments

Model description

I used michaelf34/infinity:0.0.55 to deploy mixed_bread_large reranker.

The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.

What is the best way to optimize the deployment and inference?

Open source status

[X] The model implementation is available on transformers
[X] The model weights are available on huggingface-hub
[x] I verified that the model is currently not running in the lastest version pip install infinity_emb[all] --upgrade

Provide useful links for the implementation

No response

Sep 12 '24 10:09 wolfassi123

infinity infinity copied to clipboard

Write a custom flash-attention function for the deberta model.

Model description

Open source status

Provide useful links for the implementation

infinity
infinity copied to clipboard