rasa icon indicating copy to clipboard operation
rasa copied to clipboard

Intel OpenVINO backend

Open dkurt opened this issue 3 years ago • 30 comments

Proposed changes:

resolves https://github.com/RasaHQ/rasa/issues/9849

This PR introduces Intel OpenVINO backend for HuggingFace networks optimization. Pip package openvino brings runtime for CPU/iGPU/VPU and other hardware targets.

Enabled models:

  • bert-base-chinese
  • bert-base-uncased
  • distilbert-base-uncased
  • gpt2
  • openai-gpt
  • rasa_LaBSE
  • roberta-base

Disabled for now: "xlnet"

Status (please check what you already did):

  • [x] added some tests for the functionality
  • [x] updated the documentation
  • [x] updated the changelog (please check changelog for instructions)
  • [x] reformat files using black (please check Readme for instructions)

Enable DoD

  • [ ] ML engineer review
  • [ ] talk to infra about CI changes

dkurt avatar Oct 08 '21 06:10 dkurt

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 08 '21 06:10 CLAassistant

@alopez, Hi! Is there examples which I can run for efficiency analysis? To measure throughput of proposed backend.

dkurt avatar Oct 08 '21 13:10 dkurt

Intel OpenVINO backend for HuggingFace networks optimization

Hi @dkurt. Can you help me better understand the use case for this? When and why do you imagine Rasa Open Source users will use this? How will it help them to build or maintain their conversational interfaces?

TyDunn avatar Oct 18 '21 07:10 TyDunn

Hi, @TyDunn!

OpenVINO optimizes inference time (latency and throughput). So users might use this feature to improve efficiency of their pipelines.

To confirm the speed up, I wanted to ask for efficiency benchmarks which I could use.

dkurt avatar Oct 18 '21 08:10 dkurt

@dkurt To confirm what speed up? Can you describe what sort of efficiency benchmarks you are looking for and what you would like to do with them?

TyDunn avatar Oct 19 '21 09:10 TyDunn

@TyDunn, latency of deep learning based pipeline (milliseconds per request or anything else). Any kind of example/demo which uses Hugging Face models inside RASA. Wanted to compare TensorFlow based inference versus OpenVINO engine.

dkurt avatar Oct 19 '21 12:10 dkurt

Hi @dkurt. I am assuming you have a bot already to try this out but if not, feel free to use the bot that comes with the command rasa init or the bot present in examples/moodbot folder of rasa repo.

Next, you can use this config:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
  - name: DIETClassifier
    constrain_similarities: true
    epochs: 100
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true

This should use the BERT model inside LanguageModelFeaturizer component.

dakshvar22 avatar Oct 19 '21 12:10 dakshvar22

@dkurt Thanks for answering all of my questions. I am trying to understand the context of what you are doing here. What steps would users need to take to use OpenVINO to optimize the latency of deep learning based pipeline?

TyDunn avatar Oct 19 '21 12:10 TyDunn

@dakshvar22, @TyDunn, here what I've tried:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    use_openvino: true
    openvino_max_length: 9
  - name: DIETClassifier
    constrain_similarities: true
    epochs: 100
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: trues

Where use_openvino enables OpenVINO backend and openvino_max_length is a parameter which helps to initialize for fixed input length.

Then, after rasa init and rasa train I do

rasa test --stories data/stories.yml

OpenVINO backend: 5s (13.54it/s)

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68/68 [00:05<00:00, 13.54it/s]

TensorFlow backend: 7s (8.79it/s)

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68/68 [00:07<00:00,  8.79it/s]

Is there heavier pipeline which I can use for benchmarking?

dkurt avatar Oct 29 '21 13:10 dkurt

Hi! Just wanted to ask if you have some comments here.

dkurt avatar Nov 09 '21 12:11 dkurt

@dkurt Thanks for running that quick experiment.

Question: When you say TensorFlow backend, what is the config that you use? I am particularly interested in knowing what's the max length of the input in that case? I can see that you used 9 when use_openvino is True but can't see the value when its False (assuming thats when the TF backend kicks in)

dakshvar22 avatar Nov 10 '21 12:11 dakshvar22

@dakshvar22,

Sure, sorry for not complete description. TensorFlow backend means default backend for lm_featurizer models from HuggingFace (which if TensorFlow). So config is just

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
  - name: DIETClassifier
    constrain_similarities: true
    epochs: 100
  - name: ResponseSelector
    epochs: 100
    constrain_similarities: trues

The options use_openvino: true and openvino_max_length: 9 are optional. use_openvino enables OpenVINO when value true.

openvino_max_length: 9 is a config specific for OpenVINO. There is currently a limitation which requires long internal reinitialization if model processes variable length inputs. This parameter is similar to tokenizer's max_length + pad_to_max_length when we pad input to fixed length and then just cut the output. I've named this value with openvino_ prefix to indicate that it's OpenVINO only.

dkurt avatar Nov 10 '21 12:11 dkurt

Right, that's what I thought.. Can you set that parameter to the maximum sequence length you would get on the dataset with rasa init? That will ensure that both the backends are processing sequences of same length which will remove any differences in time occurring because of different size of sequence lengths.

dakshvar22 avatar Nov 10 '21 13:11 dakshvar22

@dkurt Let's try running these benchmarks on a larger dataset. I'd suggest cloning one of our public bots: Carbon bot. Once you have cloned, you can change the config.yml file to the above config that you have been using.

For benchmarking, what really matters is the time taken by LanguageModelFeaturizer inside its train and process method. It would be great if you can setup a way to track these times inside the respective functions (simple calls to time.time() at the start and end of the function would do?).

To record the time in train, you can run rasa train nlu and see the times reported by your custom setup.

To record the time in process, what you really want to do is measure how much time it takes to process one message. We'll have to write a bit of code for this and do somethings manually.

I think something like this would work:

from rasa.shared.nlu.training_data.loading import load_data
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer

data = load_data(<path to nlu.yml of carbon bot>)
featurizer = LanguageModelFeaturizer({"model_name": <>, "model_weights": <>, "cache_dir": <>})

for example in data.training_examples:
    featurizer.process(example)

We should record the average amount of time it takes for featurizer.process(example) to run.

(I know it would be great to have all of this automated and we are currently in the process of setting up that infrastructure but I don't want you to be blocked by it, so we would appreciate if you could try this out manually for now :) )

dakshvar22 avatar Nov 10 '21 14:11 dakshvar22

Hi, @dakshvar22. Thanks for help!

I used the following script to measure process efficiency:

Script
import time
import logging
import numpy as np
from rasa.shared.nlu.training_data.loading import load_data
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from rasa.engine.graph import (
    ExecutionContext,
    GraphSchema,
)
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer

logging.basicConfig()
logger = logging.getLogger("rasa").setLevel(logging.INFO)

data = load_data('data/nlu.yml')

def create_whitespace_tokenizer(config=None):
    return WhitespaceTokenizer(
        {**WhitespaceTokenizer.get_default_config(), **(config if config else {}),}
    )

tk = create_whitespace_tokenizer()
tk.process_training_data(data)

def run(use_openvino):
    featurizer = LanguageModelFeaturizer({"model_name": "bert",
                                          "model_weights": "rasa/LaBSE",
                                          "cache_dir": None,
                                          "use_openvino": use_openvino,
                                          "openvino_max_length": 77,
                                          "alias": ""},
                                         execution_context=ExecutionContext(GraphSchema({}), "1"))

    total_start = time.time()
    times = []
    for example in data.training_examples:
        start = time.time()
        featurizer.process([example])
        times.append(time.time() - start)

    print(f"Use OpenVINO={use_openvino}")
    print(f"Total time: {time.time() - total_start} seconds ({len(times)} messages)")
    print(f"Median for one message: {np.median(times)} seconds")
    print()

run(use_openvino=False)
run(use_openvino=True)

Results:

Use OpenVINO=False
Total time: 230.84485697746277 seconds (2517 messages)
Median for one message: 0.08835482597351074 seconds

Use OpenVINO=True
Total time: 122.20457315444946 seconds (2517 messages)
Median for one message: 0.04765892028808594 seconds

CPU: Intel(R) Core(TM) i7-6700K

OpenVINO relies on manually specified openvino_max_length value to pad input tokens by zeros instead of internal model reinitialization (I believe there is coming feature in future OpenVINO releases which will allow to avoid such parameter).

dkurt avatar Nov 11 '21 07:11 dkurt

@dkurt Those are some great results! Thanks for carrying out that experiment.

For sake of completeness, I would request you to carry out one more experiment -> We recently upgraded the tensorflow version in our dependencies and that brings in a slowdown in process of LanguageModelFeaturizer. There are some more details of the impact of upgrading tensorflow in this blog post.

To completely rule out its effect on the numbers we are seeing above, can you please create a new PR, but this time targetting 2.7.x branch of Rasa OSS, port the changes required for OpenVino there and run the same benchmark? You can leave this PR still open. If we decide to merge it, we'll still merge it to main but would be good to have numbers with a version of Rasa OSS which is not impacted by the performance degradation.

Thanks again for your patience on this contribution! 🙏

dakshvar22 avatar Nov 11 '21 08:11 dakshvar22

@dakshvar22, indeed there is a difference in TensorFlow numbers. Rebased the changes to 2.7.x branch: https://github.com/dkurt/rasa/tree/2.7.x_openvino

Adopted script
import time
import logging
import numpy as np
from rasa.shared.nlu.training_data.loading import load_data
from rasa.nlu.featurizers.dense_featurizer.lm_featurizer import LanguageModelFeaturizer
from rasa.nlu.tokenizers.whitespace_tokenizer import WhitespaceTokenizer

logging.basicConfig()
logger = logging.getLogger("rasa").setLevel(logging.INFO)

data = load_data('data/nlu.yml')

def create_whitespace_tokenizer(config=None):
    return WhitespaceTokenizer(
        {**WhitespaceTokenizer.defaults, **(config if config else {}),}
    )

tk = create_whitespace_tokenizer()
for example in data.training_examples:
    tk.process(example)


def run(use_openvino):
    featurizer = LanguageModelFeaturizer({"model_name": "bert",
                                          "model_weights": "rasa/LaBSE",
                                          "cache_dir": None,
                                          "use_openvino": use_openvino,
                                          "openvino_max_length": 77,
                                          "alias": ""})

    total_start = time.time()
    times = []
    for example in data.training_examples:
        start = time.time()
        featurizer.process(example)
        times.append(time.time() - start)

    print(f"Use OpenVINO={use_openvino}")
    print(f"Total time: {time.time() - total_start} seconds ({len(times)} messages)")
    print(f"Median for one message: {np.median(times)} seconds")
    print()

run(use_openvino=False)
run(use_openvino=True)
Use OpenVINO=False
Total time: 196.52477836608887 seconds (2517 messages)
Median for one message: 0.07500743865966797 seconds

Use OpenVINO=True
Total time: 123.32836174964905 seconds (2517 messages)
Median for one message: 0.04769301414489746 seconds

TensorFlow is 2.3.4

dkurt avatar Nov 11 '21 09:11 dkurt

Thanks @dkurt ! Looking at the description of your PR, it seems like we need to maintain openvino "compatible" versions of model weights:

  1. What does it take to convert any model weights of a specific architecture (like BERT, etc.) to OpenVino format?
  2. Do you all plan to actively support the models hosted at https://huggingface.co/dkurt/openvino?

dakshvar22 avatar Nov 17 '21 16:11 dakshvar22

@dakshvar22, the models can be generated in runtime by the following method (already a part of PR):

Conversion code
    def _load_model(self, input_ids: np.ndarray, attention_mask: np.ndarray) -> None:
        # Serialize a Keras model
        @tf.function(
            input_signature=[
                {
                    "input_ids": tf.TensorSpec(
                        (None, None), tf.int32, name="input_ids"
                    ),
                    "attention_mask": tf.TensorSpec(
                        (None, None), tf.int32, name="attention_mask"
                    ),
                }
            ]
        )
        def serving(inputs: List[tf.TensorSpec]) -> tf.TensorSpec:
            output = self.model.call(inputs)
            return output[0]

        self.model.save("keras_model", signatures=serving)

        # Convert to OpenVINO IR
        proc = subprocess.Popen(
            [
                sys.executable,
                "-m",
                "mo",
                "--saved_model_dir=keras_model",
                "--model_name",
                self.model_name,
                "--input",
                "input_ids,attention_mask",
                "--input_shape",
                "{},{}".format(input_ids.shape, attention_mask.shape),
                "--disable_nhwc_to_nchw",
                "--data_type=FP16",
            ],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )
        proc.communicate()

        # Load model into memory
        self.net = self.ie.read_network(self.model_name + ".xml")

However it's time consuming procedure so I uploaded models to the hub. Maybe it's better to modify the proposal to generate model once for users into the cache?

Actually, OpenVINO has a compatibility for model versions. Also, hub supports model versioning so we can reload them for new releases.

dkurt avatar Nov 18 '21 08:11 dkurt

@dakshvar22, did the following changes:

  • Models now converted in runtime with no external reference to custom hub
  • Once converted, OpenVINO models are saved in common cache
  • Skip CI tests despite distilbert - others consume too much memory and Actions crash. Works fine locally.

:pray: many thanks for your reviews and patience!

dkurt avatar Nov 20 '21 11:11 dkurt

@dkurt Summarizing all the findings here:

  1. This PR attempts to speed up the inference time inside LanguageModelFeaturizer on CPU environments.
  2. Internal benchmarking suggests a speedup of 1.37x when using the openvino backend for inference on CPU.
  3. All models need to be converted to openvino format before they can be used with the openvino backend. This can be done offline and hosted on the hf model hub or on the fly as a one time activity when the model is loaded.
  4. Users need to necessarily specify a maximum input sequence length if using the openvino backend.

Does this sum up the important points you would like us to know about the contribution?

dakshvar22 avatar Nov 22 '21 13:11 dakshvar22

@dakshvar22, yes, completely! Do you think I need to put this summary to changelog/9826.feature.md or docs/docs/openvino-backend.mdx?

dkurt avatar Nov 22 '21 16:11 dkurt

@dkurt Thanks for confirming! I'd hold off making any more changes to the PR. I've communicated the intended effects of this PR to product management internally and we'll be back soon with the next steps. Thanks for your patience 🙏

dakshvar22 avatar Nov 23 '21 10:11 dakshvar22

Hi! Do you think it might reduce maintenance effort and help with integration decision if we wrap OpenVINO related code into a separate package? In example, like Hugging Face's Optimum extension there is https://github.com/dkurt/optimum-openvino (work in progress).

dkurt avatar Dec 27 '21 15:12 dkurt

Hi @dkurt, Thanks for the PR and sorry for making you wait so long. We're currently trying to decide whether we should keep this in the codebase or make it into a separate package. From what I understand, if we go with a separate package you'll be mainly responsible for maintaining it. If this is correct and if you're willing do to it, then what I'd like to know is 1. how do you imagine supporting it i.e. what kind of maintenance do you intent to do? and 2. will you need any assistance from us?

jupyterjazz avatar Jan 25 '22 10:01 jupyterjazz

Hi, @jupyterjazz! No worries :) I have opened a PR which demonstrates how we can move OpenVINO related code to a separate package: https://github.com/dkurt/rasa/pull/5.

  1. how do you imagine supporting it i.e. what kind of maintenance do you intent to do? and

The package will be regularly tested on the same version of Transformers library which RASA uses. For now it's under my account but if that approach is more comfortable, we will initiate transfer under https://github.com/openvinotoolkit/ org.

  1. will you need any assistance from us?

Thanks! Just a review, I will be in touch how integration goes and can help with users experience.

dkurt avatar Jan 26 '22 15:01 dkurt

@dakshvar22, @jupyterjazz, Good news! Reproduced the same experiment as https://github.com/RasaHQ/rasa/pull/9826#issuecomment-966051532 but with coming OpenVINO release (2022.1). It will propose a dynamism feature which allows process variable input sizes. So parameter openvino_max_length could be avoided.

Thus, the performance is:

OpenVINO 2022.1 openvino_max_length 77
Total time: 140.69154858589172 seconds (2517 messages)
Median for one message: 0.05508852005004883 seconds

OpenVINO 2022.1 dynamic shapes (no openvino_max_length)
Total time: 69.31450295448303 seconds (2517 messages)
Median for one message: 0.027573347091674805 seconds

dkurt avatar Feb 17 '22 07:02 dkurt

Hi! I have updated this PR to use a separate package for OpenVINO related logic. I'm going maintain and develop it. I've also initiated a transfer procedure to https://github.com/openvinotoolkit/.

dkurt avatar Feb 18 '22 14:02 dkurt

Hi! Glad to inform that we transferred the package to OpenVINO org (here). Updated this pull request correspondingly.

dkurt avatar Mar 04 '22 11:03 dkurt

@jupyterjazz, @dakshvar22, I have updated PR to use the latest OpenVINO version 2022.1. Also, removed max_length option.

dkurt avatar Mar 23 '22 10:03 dkurt