SpanMarkerNER icon indicating copy to clipboard operation
SpanMarkerNER copied to clipboard

inference time cpu vs gpu

Open ganga7445 opened this issue 1 year ago • 3 comments

I have used gte-tiny embeddings for my custom NER model and need to speed up the inference time. below are stats for different batch sizes.

Batch Size Average Inference Time (ms)- GPU Average Inference Time (ms)- CPU
16 0.14945 1.23388
32 0.28 3.24456
64 0.51582 6.57234
128 1.10669 13.73319
256 2.24729 28.236

Is there any specific method to enhance it? @tomaarsen

ganga7445 avatar Nov 06 '23 14:11 ganga7445

You may experience improved speed if you use SpanMarkerModel.from_pretrained(..., torch_dtype=torch.float16) or torch.bfloat16. See e.g.:


import time
import torch
from span_marker import SpanMarkerModel

model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super", torch_dtype=torch.bfloat16, device_map="cuda")
# model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super", device_map="cuda")

text = [
    "Leonardo da Vinci recently published a scientific paper on combatting Mitocromulent disease. Leonardo da Vinci painted the most famous painting in existence: the Mona Lisa.",
    "Leonardo da Vinci scored a critical goal towards the end of the second half. Leonardo da Vinci controversially veto'd a bill regarding public health care last friday. Leonardo da Vinci was promoted to Sergeant after his outstanding work in the war."
]
BS = 64
N = 500
model.predict(text * 50, batch_size=BS)
start_t = time.time()
model.predict(text * N, batch_size=BS)
print(f"{time.time() - start_t:8f}s for {N * 2} samples with batch_size={BS} and torch_dtype={model.dtype}.")

This gave me:

20.745640s for 1000 samples with batch_size=64 and torch_dtype=torch.float16.
16.534876s for 1000 samples with batch_size=64 and torch_dtype=torch.bfloat16.

and

39.655506s for 1000 samples with batch_size=64 and torch_dtype=torch.float32.

Note that float16 is not available on CPU though! Not sure about bfloat16.

If you have a Linux (or Mac?) device, then you can also use load_in_8bit=True and load_in_4bit=True by installing bitsandbytes, but I don't know if that improves inference speed - this is also only for CUDA.

Beyond that the steps to increase the inference speeds become pretty challenging. Hope this helps a bit.

Also, you can process about 8 sentences per second with CPU and about 110 sentences per second in GPU, is that not sufficiently fast yet?

  • Tom Aarsen

tomaarsen avatar Nov 06 '23 14:11 tomaarsen

thanku @tomaarsen Using torch.float16 was working for me. It would be excellent if the operation could be completed in less than one second with a batch size of 256.

Batch Size Average Inference Time (ms) new inference time(ms)
16 0.14945 0.09211015701
32 0.28 0.1645913124
64 0.51582 0.2973537445
128 1.10669 0.6381671429
256 2.24729 1.238643169

ganga7445 avatar Nov 09 '23 07:11 ganga7445

@polodealvarado started working on ONNX support here: https://github.com/tomaarsen/SpanMarkerNER/issues/26#issuecomment-1802366931 If we can make it work, perhaps then we can improve the speed even further. Until then, it will be hard to get even faster results. Less than a second for a batch size of 256 equals 256 sentences per second, that is already quite efficient.

  • Tom Aarsen

tomaarsen avatar Nov 09 '23 07:11 tomaarsen