instructor-embedding icon indicating copy to clipboard operation
instructor-embedding copied to clipboard

Improving inference time

Open alokpadhi opened this issue 1 year ago • 2 comments

I am using the Instructor Base model and did the quantization on top of it to improve the inference time. But even after doing the quantization the inference time is between 6-7 secs. Whereas based on my required I need to make it under 1 sec. Are there any other ways to improve the inference time of the model?

Server configuration:

  • Memory: 8 GB
  • CPUs: 4 cores

alokpadhi avatar Feb 15 '24 11:02 alokpadhi

Hello, I'm also seeking this kind of speed improvement. Do you have any good methods to share in the end?

EricPaul03 avatar May 20 '24 09:05 EricPaul03

You can use something like this:

model.client[0].auto_model = model.client[0].auto_model.to(torch_dtype)

However, you'll need to import the following:

from torch.cuda.amp import autocast, GradScaler

You can use it after instantiating the model. Unfortunately, I was unable to find a way to use the stereotypical torch.dtype approach.

BBC-Esq avatar Aug 12 '24 21:08 BBC-Esq