optimum Optimum model not using all available CPU threads

System Info

- python 3.8.13
- Ubuntu 22.04
- optimum 1.3.0
- torch 1.11.0  (cpu version)
- transformers 4.21.0
- datasets 2.4.0
- onnxruntime 1.12.0

Who can help?

@philschmid

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run the script below to make predictions on the IMDB dataset with a pipeline. Toggle between using an optimum model and a transformers model by switching which model=... line is commented.

The transformers pipeline utilizes all CPU threads:

The optimum model does not:

import datasets
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification

model_arch="google/bigbird-roberta-base"
max_length=4096
id2label={0: 'NEGATIVE', 1: 'POSITIVE'}

# Toggle between optimum and transformers model here ----------------------------------------
#model=AutoModelForSequenceClassification.from_pretrained(model_arch)
model=ORTModelForSequenceClassification.from_pretrained(model_arch,from_transformers=True)
# -------------------------------------------------------------------------------------------
model.config.id2label=id2label 

raw_data = datasets.load_dataset("imdb",split='test').shuffle(seed=42).select(range(500))

print(raw_data)

tokenizer=AutoTokenizer.from_pretrained(model_arch,truncation=True,padding='max_length',max_length=max_length)
pipe = pipeline('text-classification',model=model,tokenizer=tokenizer)

print("Running inference...")
model_outputs = pipe(raw_data["text"])

Expected behavior

The expected behavior would be for the optimum model to behave like the transformers model and use all available CPU threads.

Aug 02 '22 14:08 jessecambon

Hello @jessecambon

We have a PR already open and almost done which allows you to provide the session option.

#271 this should defining the correct values you want

Aug 03 '22 07:08 philschmid

In my experience, by default onnxruntime use only physical cores, while PyTorch may use hyperthreading. For example, on my laptop with 10 physical cores (but 2 threads per core), torch.get_num_threads() gives 14 by default.

With ONNX Runtime:

While with PyTorch:

Run with this ugly code and mpstat -P ALL 1:

from functools import partial

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.modeling_ort import ORTModelForSequenceClassification, ORTModel
from optimum.onnxruntime.configuration import AutoQuantizationConfig

from transformers import AutoModelForSequenceClassification

import onnxruntime

import time
import torch
from tqdm import tqdm

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
optimum_model_path = "/path/to/model.onnx"

quantizer = ORTQuantizer.from_pretrained(model_name, feature="sequence-classification")

options = onnxruntime.SessionOptions()
ort_session = onnxruntime.InferenceSession(optimum_model_path, sess_options=options)

ort_model_eval = ORTModelForSequenceClassification(ort_session)

transformers_model_eval = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = {}
inputs["input_ids"] = torch.randint(high=1000, size=(8, 128))
inputs["attention_mask"] = torch.ones(8, 128, dtype=torch.int64)

print("Running ONNX Runtime.")
for i in tqdm(range(10)):
    ort_model_eval(**inputs)

start = time.time()

for i in tqdm(range(20)):
    ort_model_eval(**inputs)

print("Time using ONNX Runtime:", time.time() - start)


time.sleep(4)

print("Running ONNX Runtime.")
for i in tqdm(range(10)):
    transformers_model_eval(**inputs)

start = time.time()

for i in tqdm(range(20)):
    transformers_model_eval(**inputs)

print("Time using PyTorch:", time.time() - start)

Aug 03 '22 09:08 fxmarty

Thanks for the code @fxmarty, I modified it a bit to compare a standard Onnx model to a transformers model and set the intra_op_num_threads Onnx session property. This looks like it makes Onnx use all CPU threads.

This is what the CPU utilization looked like before I set inter_op_num_threads (Onnx model is the first peak, transformers is second):

After setting intra_op_num_threads=torch.get_num_threads():

from optimum.onnxruntime.modeling_ort import ORTModelForSequenceClassification, ORTModel
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoModelForSequenceClassification
import onnxruntime, os, time, torch
from tqdm import tqdm

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# where to save onnx model
onnx_dir="onnx"
optimum_model_path = f"{onnx_dir}/model.onnx"

num_threads=torch.get_num_threads()
print(f"num_threads: {num_threads}")

num_iterations = 40 # how many time to process the dummy data

# ------------------------------------------------------------------------------------------

# dummy data
inputs = {}
inputs["input_ids"] = torch.randint(high=1000, size=(8, 128))
inputs["attention_mask"] = torch.ones(8, 128, dtype=torch.int64)


# convert to onnx model and save
optimum_orig = ORTModelForSequenceClassification.from_pretrained(model_name,from_transformers=True)
optimum_orig.save_pretrained(onnx_dir)

# onnxruntime model
options = onnxruntime.SessionOptions()
options.intra_op_num_threads=num_threads

ort_session = onnxruntime.InferenceSession(optimum_model_path, sess_options=options)
ort_model_eval = ORTModelForSequenceClassification(ort_session)

# transformers model (pytorch)
transformers_model_eval = AutoModelForSequenceClassification.from_pretrained(model_name)


print('Preparing to run...')

time.sleep(4)

print("Running ONNX Runtime...")
start = time.time()


for i in tqdm(range(num_iterations)):
    ort_model_eval(**inputs)

print("Time using ONNX Runtime:", time.time() - start)

time.sleep(4)

print("Running Transformers (pytorch)...")
start = time.time()

for i in tqdm(range(num_iterations)):
    transformers_model_eval(**inputs)
print("Time using Transformers (pytorch):", time.time() - start)

Aug 03 '22 14:08 jessecambon

https://github.com/huggingface/optimum/pull/271 merged, helps to fix the number of cores used

Sep 19 '22 11:09 fxmarty

optimum optimum copied to clipboard

Optimum model not using all available CPU threads

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

optimum
optimum copied to clipboard