ONE
ONE copied to clipboard
[onert] Run Batch request in parallel manner via direct call to trix-engine
This is for tracking "Milestone1 : Run Batch request in parallel manner via direct call to trix-engine(~Tizen M2, Aug 30th)" from https://github.com/Samsung/ONE/projects/8
User scenario
- User will use nnfw api to initialize nnfw_session with 1 batch (e.g. mobilenet's tvn binary) model as usual
- onert core will do tvn BATCH execution if tvn is 1-batch model and user's input shape is multi-batch(or user's input shape is multiple of model's batch)
- Q. how to determine multi-batch input w.r.t. nnfw api ?
- A. nnfw_set_input_tensorinfo(session, input_index, ti);
- Q. how to determine multi-batch input w.r.t. nnfw api ?
- Note that CPU backend supports batched inference(though each kernel in the model should support batched inference)
Todo
- [ ] Let's analyze trix-engine's parallel execution capability
- [ ] Implement in TRIX backend
Ref. How batch execution works when running tflite model in onert ?
- Suppose that the models's input shape is [1,244,244,3]
- Then, user will request an inference with input shape of [4,244,244,3] as follows :
nnfw_session *session = nullptr;
nnfw_create_session(&session);
// Loading nnpackage
nnfw_load_model_from_file(session, path_to_nnpkg);
// input shape is modified
nnfw_tensorinfo ti;
ti.rank = 4;
ti.dims = {4,244,244,3}; // syntax abuse
nnfw_set_input_tensorinfo(session, input_index, ti);
// compile model
nnfw_prepare(session);
// Prepare input. Here we just allocate dummy input arrays.
std::vector<float> input;
nnfw_input_tensorinfo(session, 0, &ti); // get first input's info
uint32_t input_elements = num_elems(&ti);
input.resize(input_elements);
// TODO: Please add initialization for your input.
nnfw_set_input(session, 0, ti.dtype, input.data(), sizeof(float) * input_elements);
// Prepare output
std::vector<float> output;
nnfw_output_tensorinfo(session, 0, &ti); // get first output's info
uint32_t output_elements = num_elems(&ti);
output.resize(output_elements);
nnfw_set_output(session, 0, ti.dtype, output.data(), sizeof(float) * output_elements);
// Do inference
nnfw_run(session);
@ragmani FYI,
how to check dynamic shape infer on nnpackage_run:
adb shell /data/local/Product/out/bin/nnpackage_run /data/local/tmp/model_nnpkg --shape_run="[0,[1,64],3,[1],2,[1,64],1,[1,64,882],4,[1,64]]" --output_sizes="[2,256]" -r 1 -l /data/local/tmp/input.h5 -d /data/local/tmp/output.h5'
I summarized my thoughts on this work. If you have questions or discussions about it, please contact me or leave a comment.
TODO
-
[ ] Add tests
- [ ] Add a test script that compares results of running
nnpackage_run(tvn)
batch size
times and results of runningnnpackage_run(tvn)
withbatch size
at once in parallel manner(implemented by this work)- Purpose : Verification of performances and functions
- Verification method: Comparison of all outputs, Comparison of performances
- [ ] Add tests that compare outputs of the simulator and
nnpackage_run(tvn)
. This requires installing the simulator and using jenkins pipeline.- Purpose : Verification of outputs
- Verification method: PEIR (Peak error to interval ratio)
- [ ] Add tests that compare outputs of
nnpackage_run(circle)
andnnpackage_run(tvn)
. This requires circle model. This requires thatcpu backend
supports batch request. (if results of running withbatch size
at once(without parallel manner) and results of runningbatch size
times are not always the same)- Purpose : Verification of outputs
- Verification method: PEIR (Peak error to interval ratio)
- [ ] Add a test script that compares results of running
-
[ ] Support for
trix backend
TBD -
[ ] Support for
cpu backend
(This task may be not needed) TBD
Kinds of implementation method
In case of supporting trix backend
only
This does not support partitioned model because batch size
changes based on a whole model, but trix backend
is a part of the whole model
- Creating
Execution
withbatch size(1)
of a model, Executing all batches within theExecution
in parallel manner(This is possible becausetrix backend
does not deal with internal tensors in a model) Pros: Simple implementation Cons: Partitioned model not supported
In case of supporting cpu backend
as well
These can support partitioned model.
- Creating
Execution
with batch_size gotten in prepare step. Executing all batches within theExecution
in parallel manner Pros: Performance(in case of first execution), Memory usage (in case where dynamic inferer does not support memory optimization) Cons: This does not support a scenario thatbatch size
changes for each execution. - Creating
Execution
with batch_size(always 1?) of a model, CloningExecution
as much asbatch size
, Executing allExecution
assigned to a batch in parallel manner Pros: This can support a scenario thatbatch size
changes per each execution. Cons: Performance(wheneverbatch size
changes), I'm not sure if there is another problem - Creating
Execution
with batch_size(always 1?) of a model, Reallocating memories of internal tensors within theExecution
, Executing theExecution
in parallel manner by batch Pros: This can support a scenario thatbatch size
changes per each execution. Cons: Performance(wheneverbatch size
changes), Complicated implementation - ...
Questions or Discussion subjects
These are questions I'm curious about or matters to be discussed.
- Is there any reason to support
cpu backend
as well? My answer is "to support partitioned model as well". - Is it Okay to assume that
batch size
of a whole model(tvn, circle) is always 1 even if we also supportscpu backend
? - Are we sure if results of running with
batch size
at once and results of runningbatch size
times are always the same? - Should we support changing
batch size
for each execution? - How can the intention of users to change
batch size
be distinguished? By adding a specific api? By adding a parameter ofnnfw_set_input_tensorinfo
? ...
Is there any reason to support cpu backend as well?
For now, not yet.
IMHO, you can go assuming that "trix backend only" at the first step.
- How can the intention of users to change batch size be distinguished? By adding a specific api? By adding a parameter of nnfw_set_input_tensorinfo?
onert
also distinguishes which input has batch among inputs of a model. To distinguish batch input and the intention of users to execute a model in parallel manner, I think we need to add an nnfw api. I suggest the way below:
typedef struct nnfw_tensorinfo
{
/** The data type */
NNFW_TYPE dtype;
/** The number of dimensions (rank) */
int32_t rank;
/**
* The dimension of tensor.
* Maximum rank is 6 (NNFW_MAX_RANK).
*/
int32_t dims[NNFW_MAX_RANK];
+ bool has_parallel_batches[NNFW_MAX_RANK];
} nnfw_tensorinfo;
I'm not sure if this is the best way, but this way allows the user to specify which input will be executed in parallel.
@ragmani
Q. has_parallel_batches
of [1,0,0,0 ] means that 4 is batch dimention in input shape [4,224,224,3] ?
Q. has_parallel_batches of [1,0,0,0 ] means that 4 is batch dimension in input shape [4,224,224,3] ?
Yes, has_parallel_batches
of [1,0,0,0 ] means that 1st dimension is batch.
This member is simple but has some implications.
- The position of batch among these dimensions of an input.
- Inputs for which one of these dimensions is
true
will be executed in parallel. - Models that has one or more inputs above will be executed in parallel.
If we can be sure if the position of batch is 1st dimension, this member can be modified to a variable instead of an array.
Thinking about it again, just below is enough.
typedef struct nnfw_tensorinfo
{
/** The data type */
NNFW_TYPE dtype;
/** The number of dimensions (rank) */
int32_t rank;
/**
* The dimension of tensor.
* Maximum rank is 6 (NNFW_MAX_RANK).
*/
int32_t dims[NNFW_MAX_RANK];
+ int32_t parallel_batch_dim;
} nnfw_tensorinfo;
And I heard from @glistening there is a way to use config
. We can use the way instead of adding an api.
Anyway, this suggestion is not to be decided right now since we can implement this task with the assumption that the position of parallel batch is batch of mv
model now.
I have a plan to change the member _lowergraph
of exec::ExecutorBase
to a reference or pointer for this task. I think it would be better to use shared_ptr
to allow sharing ownership between executors.
https://github.com/Samsung/ONE/blob/686670dd73a0a0cf56d84a9e25e918a570328918/runtime/onert/core/src/exec/ExecutorBase.h#L87-L90
But I'm not sure if it will affect other tasks such as #9610 or if there are other issues I'm not aware of.
@Samsung/one_onert
please give your any opinion on this plan.
@ragmani I think it is okay, but @hseok-oh may have something that I am missing.
I have a plan to change the member
_lowergraph
ofexec::ExecutorBase
to a reference or pointer for this task
IMO, lowered graph
should create one executor
because lowered graph
includes compile result. Could you explain why we need this change? Do you want to create multiple executor
from one lowered graph
? Or do you want to maintain reference to other executor
's lowered graph
on executor ?
IMO, lowered graph should create one executor because lowered graph includes compile result. Could you explain why we need this change?
I have a plan to clone one batch executor to multiple executors that share a compiled lowered graph in execution time of onert
. I thought it's memory waste if the lowered graph is also cloned.
Do you want to create multiple executor from one lowered graph?
Yes, I want to create multiple executors from compiled lowered graph of a executor to be cloned to multiple executors.
Or do you want to maintain reference to other executor's lowered graph on executor ?
I plan to use shared_ptr
, and ownership of lowered graph is going to not be shared among non-cloned executors.
I don't know your big picture, but maybe you want to execute at once multiple executors. But I have concern lowered graph
access and executors sharing lowered graph
are thread-safe. Anyway, for code implementation safety, please use const shared_ptr
(std::shared_ptr<const LoweredGraph>
) on ExecutorBase
if you want to use shared_ptr
. And delete LoweredGraph
's copy constructor for both const
and non-const
.
I don't know your big picture, but maybe you want to execute at once multiple executors. But I have concern
lowered graph
access and executors sharinglowered graph
are thread-safe. Anyway, for code implementation safety, please useconst shared_ptr
(std::shared_ptr<const LoweredGraph>
) onExecutorBase
if you want to useshared_ptr
OK, I will do so.
And delete LoweredGraph's copy constructor for both const and non-const.
What is copy constructor for non-const?
both const and non-const
Please ignore this comment.
And delete LoweredGraph's copy constructor
OK I will.
As I understand, @ragmani is going to run on multiple trix cores in parallel and there is only trix-backend. (https://github.com/Samsung/ONE/issues/9207#issuecomment-1208826198).
-
Under this condition, there is no state in cpu. If input, output and temporary segment for each npu core are provided separately, there may be no shared resource.
-
It would be good if we can guarantee the lowered_graph is shared only under this condition (trix only backend and batch request).
It would be good if we can guarantee the lowered_graph is shared only under this condition (trix only backend and batch request).
lowered_graph
of exec::ExecutorBase
is already a completed result and does not changes now after creating executors based on exec::ExecutorBase
. So I think there is no problem if _lowered_graph
of exec::ExecutorBase
is shared among executors regardless of which backend is used.
Little thought about Q. how to determine multi-batch input w.r.t. nnfw api ?
and https://github.com/Samsung/ONE/pull/9583.
I'd like to recognize user's intention for batch execution w/o additional feature like https://github.com/Samsung/ONE/pull/9583.
Thus, I wonder that just invoking nnfw_set_input_tensorinfo(session, input_index, ti);
with larger size of input tensors implies user's batch intention.
IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer.
Pros of this approach is that there is no change on user's view.
But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.
lowered_graph
ofexec::ExecutorBase
is already a completed result and does not changes now after creating executors based onexec::ExecutorBase
. So I think there is no problem if_lowered_graph
ofexec::ExecutorBase
is shared among executors regardless of which backend is used.
Ah, In https://github.com/Samsung/ONE/issues/9207#issuecomment-1226630652, I was confused between ExecutorBase
and LoweredGraph
. I misunderstood that you want to reuse ExecutorBase
. : )
IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer. Pros of this approach is that there is no change on user's view. But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.
I also thought this approach. And I thought below approach for this issue
- Check input size is acceptable on shape inference
- To handle simple case only, allow 1 input with rank 4, 1 output with rank 4 only
- Get batch size on trix backend
Bulk
op inference from input shape - Request to trix system software batch times
I don't know we can implement this approach on our runtime core and backend.
@chunseoklee
Little thought about Q. how to determine multi-batch input w.r.t. nnfw api ? and https://github.com/Samsung/ONE/pull/9583. I'd like to recognize user's intention for batch execution w/o additional feature like https://github.com/Samsung/ONE/pull/9583.
In the short term, I'd like to do that too. However, in the long run, I think a new feature(api or anything) that recognizes intention of users should be added. As I implicitly mentioned in https://github.com/Samsung/ONE/issues/9207#issuecomment-1217493357, there are some ambiguities in ensuring intention for batch execution.
- Which of the user inputs contain batches?
- Which dimension is batch among dimensions of a batch input?
- If multiple inputs are affected by batch and their size changes, can
onert
be sure that those inputs will always multiply by the batch size? Thinking about multi-batch at this time, the No. 3 is an additional ambiguity.
IMHO, it can be done by checking input shapes of Bulk op in Static or Dynamic ShapeInferer. Pros of this approach is that there is no change on user's view. But, this approach assumes that Bulk's internal model only accepts static shape of inputs, which cannot be changed. I am not sure that this is a reasonable assumption and scenario.
Currently, backends of onert
handles everything inside an Executor
. And onert
core creates executors for each subgraph
, but batch is assigned in units of model. From a long-term perspective, without this concept change, onert
's backend cannot be sure which dimension of used executor''s inputs is batch. So I'm trying to deal with batch execution in onert
core.
Just FYI.
Initially, I tried to implement this task by cloning existing executors. However, I realize that the way is difficult and complicated, and I am trying to create executors for each batch by recompiling the existing LoweredGraph
of executors again in Execution
.