ONE [onert] Run Batch request in parallel manner via direct call to trix-engine

This is for tracking "Milestone1 : Run Batch request in parallel manner via direct call to trix-engine(~Tizen M2, Aug 30th)" from https://github.com/Samsung/ONE/projects/8

User scenario

User will use nnfw api to initialize nnfw_session with 1 batch (e.g. mobilenet's tvn binary) model as usual
onert core will do tvn BATCH execution if tvn is 1-batch model and user's input shape is multi-batch(or user's input shape is multiple of model's batch)
- Q. how to determine multi-batch input w.r.t. nnfw api ?
  - A. nnfw_set_input_tensorinfo(session, input_index, ti);
Note that CPU backend supports batched inference(though each kernel in the model should support batched inference)

Todo

[ ] Let's analyze trix-engine's parallel execution capability
[ ] Implement in TRIX backend

Ref. How batch execution works when running tflite model in onert ?

Suppose that the models's input shape is [1,244,244,3]
Then, user will request an inference with input shape of [4,244,244,3] as follows :

  nnfw_session *session = nullptr;
  nnfw_create_session(&session);

  // Loading nnpackage
  nnfw_load_model_from_file(session, path_to_nnpkg);

  // input shape is modified
  nnfw_tensorinfo ti;
  ti.rank = 4;
  ti.dims = {4,244,244,3}; // syntax abuse
  nnfw_set_input_tensorinfo(session, input_index, ti);

  // compile model
  nnfw_prepare(session);

  // Prepare input. Here we just allocate dummy input arrays.
  std::vector<float> input;
  nnfw_input_tensorinfo(session, 0, &ti); // get first input's info
  uint32_t input_elements = num_elems(&ti);
  input.resize(input_elements);
  // TODO: Please add initialization for your input.
  nnfw_set_input(session, 0, ti.dtype, input.data(), sizeof(float) * input_elements);

  // Prepare output
  std::vector<float> output;
  nnfw_output_tensorinfo(session, 0, &ti); // get first output's info
  uint32_t output_elements = num_elems(&ti);
  output.resize(output_elements);
  nnfw_set_output(session, 0, ti.dtype, output.data(), sizeof(float) * output_elements);

  // Do inference
  nnfw_run(session);

May 30 '22 08:05 chunseoklee

@ragmani FYI,

how to check dynamic shape infer on nnpackage_run:

adb shell /data/local/Product/out/bin/nnpackage_run /data/local/tmp/model_nnpkg --shape_run="[0,[1,64],3,[1],2,[1,64],1,[1,64,882],4,[1,64]]" --output_sizes="[2,256]"  -r 1 -l /data/local/tmp/input.h5 -d /data/local/tmp/output.h5'

Aug 01 '22 07:08 chunseoklee

I summarized my thoughts on this work. If you have questions or discussions about it, please contact me or leave a comment.

TODO

[ ] Add tests
- [ ] Add a test script that compares results of running nnpackage_run(tvn) batch size times and results of running nnpackage_run(tvn) with batch size at once in parallel manner(implemented by this work)
  - Purpose : Verification of performances and functions
  - Verification method: Comparison of all outputs, Comparison of performances
- [ ] Add tests that compare outputs of the simulator and nnpackage_run(tvn). This requires installing the simulator and using jenkins pipeline.
  - Purpose : Verification of outputs
  - Verification method: PEIR (Peak error to interval ratio)
- [ ] Add tests that compare outputs of nnpackage_run(circle) and nnpackage_run(tvn). This requires circle model. This requires that cpu backend supports batch request. (if results of running with batch size at once(without parallel manner) and results of running batch size times are not always the same)
  - Purpose : Verification of outputs
  - Verification method: PEIR (Peak error to interval ratio)
[ ] Support for trix backend TBD
[ ] Support for cpu backend (This task may be not needed) TBD

Kinds of implementation method

In case of supporting `trix backend` only

This does not support partitioned model because batch size changes based on a whole model, but trix backend is a part of the whole model

Creating Execution with batch size(1) of a model, Executing all batches within the Execution in parallel manner(This is possible because trix backend does not deal with internal tensors in a model) Pros: Simple implementation Cons: Partitioned model not supported

In case of supporting `cpu backend` as well

These can support partitioned model.

Creating Execution with batch_size gotten in prepare step. Executing all batches within the Execution in parallel manner Pros: Performance(in case of first execution), Memory usage (in case where dynamic inferer does not support memory optimization) Cons: This does not support a scenario that batch size changes for each execution.
Creating Execution with batch_size(always 1?) of a model, Cloning Execution as much as batch size, Executing all Execution assigned to a batch in parallel manner Pros: This can support a scenario that batch size changes per each execution. Cons: Performance(whenever batch size changes), I'm not sure if there is another problem
Creating Execution with batch_size(always 1?) of a model, Reallocating memories of internal tensors within the Execution, Executing the Execution in parallel manner by batch Pros: This can support a scenario that batch size changes per each execution. Cons: Performance(whenever batch size changes), Complicated implementation
...

Questions or Discussion subjects

These are questions I'm curious about or matters to be discussed.

Is there any reason to support cpu backend as well? My answer is "to support partitioned model as well".
Is it Okay to assume that batch size of a whole model(tvn, circle) is always 1 even if we also supports cpu backend?
Are we sure if results of running with batch size at once and results of running batch size times are always the same?
Should we support changing batch size for each execution?
How can the intention of users to change batch size be distinguished? By adding a specific api? By adding a parameter of nnfw_set_input_tensorinfo? ...

Aug 02 '22 10:08 ragmani

Is there any reason to support cpu backend as well?

For now, not yet.

IMHO, you can go assuming that "trix backend only" at the first step.

Aug 09 '22 02:08 chunseoklee

How can the intention of users to change batch size be distinguished? By adding a specific api? By adding a parameter of nnfw_set_input_tensorinfo?

onert also distinguishes which input has batch among inputs of a model. To distinguish batch input and the intention of users to execute a model in parallel manner, I think we need to add an nnfw api. I suggest the way below:

 typedef struct nnfw_tensorinfo
 {
   /** The data type */
   NNFW_TYPE dtype;
   /** The number of dimensions (rank) */
   int32_t rank;
   /**
    * The dimension of tensor.
    * Maximum rank is 6 (NNFW_MAX_RANK).
    */
   int32_t dims[NNFW_MAX_RANK];
+  bool has_parallel_batches[NNFW_MAX_RANK];
 } nnfw_tensorinfo;

I'm not sure if this is the best way, but this way allows the user to specify which input will be executed in parallel.

Aug 17 '22 04:08 ragmani

@ragmani Q. has_parallel_batches of [1,0,0,0 ] means that 4 is batch dimention in input shape [4,224,224,3] ?

Aug 17 '22 04:08 chunseoklee

Q. has_parallel_batches of [1,0,0,0 ] means that 4 is batch dimension in input shape [4,224,224,3] ?

Yes, has_parallel_batches of [1,0,0,0 ] means that 1st dimension is batch. This member is simple but has some implications.

The position of batch among these dimensions of an input.
Inputs for which one of these dimensions is true will be executed in parallel.
Models that has one or more inputs above will be executed in parallel.

If we can be sure if the position of batch is 1st dimension, this member can be modified to a variable instead of an array.

Aug 17 '22 05:08 ragmani

Thinking about it again, just below is enough.

typedef struct nnfw_tensorinfo
 {
   /** The data type */
   NNFW_TYPE dtype;
   /** The number of dimensions (rank) */
   int32_t rank;
   /**
    * The dimension of tensor.
    * Maximum rank is 6 (NNFW_MAX_RANK).
    */
   int32_t dims[NNFW_MAX_RANK];
+  int32_t parallel_batch_dim;
 } nnfw_tensorinfo;

And I heard from @glistening there is a way to use config. We can use the way instead of adding an api. Anyway, this suggestion is not to be decided right now since we can implement this task with the assumption that the position of parallel batch is batch of mv model now.

Aug 18 '22 09:08 ragmani

I have a plan to change the member _lowergraph of exec::ExecutorBase to a reference or pointer for this task. I think it would be better to use shared_ptr to allow sharing ownership between executors. https://github.com/Samsung/ONE/blob/686670dd73a0a0cf56d84a9e25e918a570328918/runtime/onert/core/src/exec/ExecutorBase.h#L87-L90 But I'm not sure if it will affect other tasks such as #9610 or if there are other issues I'm not aware of. @Samsung/one_onert please give your any opinion on this plan.

Aug 23 '22 10:08 ragmani

@ragmani I think it is okay, but @hseok-oh may have something that I am missing.

Aug 24 '22 00:08 glistening

I have a plan to change the member _lowergraph of exec::ExecutorBase to a reference or pointer for this task

IMO, lowered graph should create one executor because lowered graph includes compile result. Could you explain why we need this change? Do you want to create multiple executor from one lowered graph? Or do you want to maintain reference to other executor's lowered graph on executor ?

Aug 24 '22 01:08 hseok-oh

IMO, lowered graph should create one executor because lowered graph includes compile result. Could you explain why we need this change?

I have a plan to clone one batch executor to multiple executors that share a compiled lowered graph in execution time of onert. I thought it's memory waste if the lowered graph is also cloned.

Do you want to create multiple executor from one lowered graph?

Yes, I want to create multiple executors from compiled lowered graph of a executor to be cloned to multiple executors.

Or do you want to maintain reference to other executor's lowered graph on executor ?

I plan to use shared_ptr, and ownership of lowered graph is going to not be shared among non-cloned executors.

Aug 24 '22 04:08 ragmani

I don't know your big picture, but maybe you want to execute at once multiple executors. But I have concern lowered graph access and executors sharing lowered graph are thread-safe. Anyway, for code implementation safety, please use const shared_ptr (std::shared_ptr<const LoweredGraph>) on ExecutorBase if you want to use shared_ptr. And delete LoweredGraph's copy constructor for both const and non-const.

Aug 24 '22 06:08 hseok-oh

I don't know your big picture, but maybe you want to execute at once multiple executors. But I have concern lowered graph access and executors sharing lowered graph are thread-safe. Anyway, for code implementation safety, please use const shared_ptr (std::shared_ptr<const LoweredGraph>) on ExecutorBase if you want to use shared_ptr

OK, I will do so.

And delete LoweredGraph's copy constructor for both const and non-const.

What is copy constructor for non-const?

Aug 24 '22 06:08 ragmani

both const and non-const

Please ignore this comment.

Aug 24 '22 06:08 hseok-oh

And delete LoweredGraph's copy constructor

OK I will.

Aug 24 '22 10:08 ragmani

As I understand, @ragmani is going to run on multiple trix cores in parallel and there is only trix-backend. (https://github.com/Samsung/ONE/issues/9207#issuecomment-1208826198).

Under this condition, there is no state in cpu. If input, output and temporary segment for each npu core are provided separately, there may be no shared resource.
It would be good if we can guarantee the lowered_graph is shared only under this condition (trix only backend and batch request).

Aug 25 '22 00:08 glistening

It would be good if we can guarantee the lowered_graph is shared only under this condition (trix only backend and batch request).

lowered_graph of exec::ExecutorBase is already a completed result and does not changes now after creating executors based on exec::ExecutorBase. So I think there is no problem if _lowered_graph of exec::ExecutorBase is shared among executors regardless of which backend is used.

Aug 25 '22 05:08 ragmani

Little thought about Q. how to determine multi-batch input w.r.t. nnfw api ? and https://github.com/Samsung/ONE/pull/9583. I'd like to recognize user's intention for batch execution w/o additional feature like https://github.com/Samsung/ONE/pull/9583.

Thus, I wonder that just invoking nnfw_set_input_tensorinfo(session, input_index, ti); with larger size of input tensors implies user's batch intention. IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer. Pros of this approach is that there is no change on user's view. But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.

Aug 25 '22 07:08 chunseoklee

lowered_graph of exec::ExecutorBase is already a completed result and does not changes now after creating executors based on exec::ExecutorBase. So I think there is no problem if _lowered_graph of exec::ExecutorBase is shared among executors regardless of which backend is used.

Ah, In https://github.com/Samsung/ONE/issues/9207#issuecomment-1226630652, I was confused between ExecutorBase and LoweredGraph. I misunderstood that you want to reuse ExecutorBase. : )

Aug 25 '22 08:08 glistening

IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer. Pros of this approach is that there is no change on user's view. But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.

I also thought this approach. And I thought below approach for this issue

Check input size is acceptable on shape inference
- To handle simple case only, allow 1 input with rank 4, 1 output with rank 4 only
Get batch size on trix backend Bulk op inference from input shape
Request to trix system software batch times

I don't know we can implement this approach on our runtime core and backend.

Aug 25 '22 08:08 hseok-oh

@chunseoklee

Little thought about Q. how to determine multi-batch input w.r.t. nnfw api ? and https://github.com/Samsung/ONE/pull/9583. I'd like to recognize user's intention for batch execution w/o additional feature like https://github.com/Samsung/ONE/pull/9583.

In the short term, I'd like to do that too. However, in the long run, I think a new feature(api or anything) that recognizes intention of users should be added. As I implicitly mentioned in https://github.com/Samsung/ONE/issues/9207#issuecomment-1217493357, there are some ambiguities in ensuring intention for batch execution.

Which of the user inputs contain batches?
Which dimension is batch among dimensions of a batch input?
If multiple inputs are affected by batch and their size changes, can onert be sure that those inputs will always multiply by the batch size? Thinking about multi-batch at this time, the No. 3 is an additional ambiguity.

IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer. Pros of this approach is that there is no change on user's view. But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.

Currently, backends of onert handles everything inside an Executor. And onert core creates executors for each subgraph, but batch is assigned in units of model. From a long-term perspective, without this concept change, onert's backend cannot be sure which dimension of used executor''s inputs is batch. So I'm trying to deal with batch execution in onert core.

Aug 25 '22 15:08 ragmani

Just FYI. Initially, I tried to implement this task by cloning existing executors. However, I realize that the way is difficult and complicated, and I am trying to create executors for each batch by recompiling the existing LoweredGraph of executors again in Execution.

Aug 25 '22 16:08 ragmani

ONE ONE copied to clipboard

[onert] Run Batch request in parallel manner via direct call to trix-engine

User scenario

Todo

Ref. How batch execution works when running tflite model in onert ?

TODO

Kinds of implementation method

In case of supporting trix backend only

In case of supporting cpu backend as well

Questions or Discussion subjects

ONE
ONE copied to clipboard

In case of supporting `trix backend` only

In case of supporting `cpu backend` as well