infinity
infinity copied to clipboard
Write a custom flash-attention function for the deberta model.
Model description
I used michaelf34/infinity:0.0.55
to deploy mixed_bread_large reranker.
The container is up and I am well capable of pinging the model using python requests, but it is a bit slow (100 requests taking 8 seconds, compared to TEI with BGE that take 0.8s for 100 requests, knowing that BGE-large and Mixed_bread _large have the same size of 335M parameters.
What is the best way to optimize the deployment and inference?
Open source status
- [X] The model implementation is available on transformers
- [X] The model weights are available on huggingface-hub
- [x] I verified that the model is currently not running in the lastest version
pip install infinity_emb[all] --upgrade
Provide useful links for the implementation
No response