api-inference-community
api-inference-community copied to clipboard
[Startup Plan] Don't manage to get CPU optimized inference API
Hi community,
I have subscribed a 7-day free trial of the Startup Plan and I wish to test CPU optimized inference API on this model: https://huggingface.co/Matthieu/stsb-xlm-r-multilingual-custom
However, when using the below code:
import json
import requests
API_URL = "https://api-inference.huggingface.co/models/Matthieu/stsb-xlm-r-multilingual-custom"
headers = {"Authorization": "Bearer API_ORG_TOKEN"}
def query(payload):
data = json.dumps(payload)
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8")), response.headers.get('x-compute-type')
payload1 = {"inputs": "Navigateur Web : Ce logiciel permet d'accéder à des pages web depuis votre ordinateur. Il en existe plusieurs téléchargeables gratuitement comme Google Chrome ou Mozilla. Certains sont même déjà installés comme Safari sur Mac OS et Edge sur Microsoft.", "options": {"use_cache": False}}
sentence_embeddings1, x_compute_type1 = query(payload1)
print(sentence_embeddings1)
print(x_compute_type1)
I got the sentence embeddings but x-compute-type header of my request return cpu and not cpu+optimized. Do I have to ask something to have CPU optimized inference?
Thanks!
Maybe of interest to @Narsil
Hi @Matthieu-Tinycoaching This is linked to: huggingface/api-inference-community#26
Community images do not implement:
- private models
- GPU inference
- Acceleration
So what you are seeing is quite normal and is expected. If you don't mind we should keep the discussion over there as all 3 are correlated.
Hi @Narsil thanks for the feedback.
However I don't understand so how I can test accelerated inference CPU API on my custom public model?
What is testable so on accelerated inference API and what should I benefit from the free trial startup plan from?
Hi, You can test transformers based models with all the API features, not sentence-transformers at the moment.
Also feature-extraction even in transformers does not have every optimizations enabled by default.
feature-extraction extracts raw hidden states, so it might be more sensitive to quantization than other pipelines, we don't know about the end user sensibility to that. It is available for every architecture in transformers, which might also lead to poorer speedups (or slowdowns sometimes) than expected on some architectures if we simply use the defaults.
But if you pin your model we would be able to run a few tests and optimize this pipeline so you can test performance.
Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)
Hi @Narsil.
Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)
Please correct me if I'm wrong. There is no support batch at the moment (although it should be almost trivial to change, it was also requested by @kvit in https://github.com/UKPLab/sentence-transformers/pull/925#issuecomment-856112449).
Hi @Narsil
You can test transformers based models with all the API features, not sentence-transformers at the moment.
Thank you for this light. Do you have an approximate schedule to when sentence-transformers will be available with all the API features?
I ran some load testing on my public model on model hub. So, if I couldn't have access to accelerated (CPU or GPU) inference for the moment I am intrigued by which architecture enabled me to load testing on CPU my public custom model. Could you precise to me physical characteristics/architecture are used then and to which pricing this correspond to since I could test it even with free plan. This, in order to better compare my benchmark on different cloud service solutions.
But if you pin your model we would be able to run a few tests and optimize this pipeline so you can test performance.
I have pin my custom model on both CPU and GPU devices. Thanks in advance for the optimization on your side in order to enable me to test performance before the end of my startup plan trial!
Anticipating but feature-extraction and sentence embeddings are usually very fast, so maybe try to batch part of the inputs, it will reduce the HTTP + network overhead of the overall computation. (Simply send a list of strings within inputs instead of a single sentence)
As highlighted by @osanseviero is there no support batch at the moment? Is there any practical tutorial on how to easily batch part of the inputs and retrieve corresponding outputs when dealing with real-time application where each input is a request from a different user?
Thanks for your time!