instructor-embedding
instructor-embedding copied to clipboard
Improving inference time
I am using the Instructor Base model and did the quantization on top of it to improve the inference time. But even after doing the quantization the inference time is between 6-7 secs. Whereas based on my required I need to make it under 1 sec. Are there any other ways to improve the inference time of the model?
Server configuration:
- Memory: 8 GB
- CPUs: 4 cores
Hello, I'm also seeking this kind of speed improvement. Do you have any good methods to share in the end?
You can use something like this:
model.client[0].auto_model = model.client[0].auto_model.to(torch_dtype)
However, you'll need to import the following:
from torch.cuda.amp import autocast, GradScaler
You can use it after instantiating the model. Unfortunately, I was unable to find a way to use the stereotypical torch.dtype approach.