ncnn icon indicating copy to clipboard operation
ncnn copied to clipboard

batch inference in partial model graph

Open nihui opened this issue 4 months ago • 3 comments

During model inference, some operators/subgraphs require multiple inferences using batches.

Here, a GRU is used to represent the batched portion

Often, there are 'across-batch-axis' reshape operations before and after the subgraph.

Input     in0         0 1 in0               \
.....                                        batch=1
XXX       xxx         1 1 xxx a             /
Reshape   reshape_a   1 1 a batch_a        --> the magic that creating batch
GRU       batch_gru0  1 1 batch_a t0        \
GRU       batch_gru1  1 1 t0 t1              batch=N
GRU       batch_gru2  1 1 t1 batch_b        /
Reshape   reshape_b   1 1 batch_b b        --> the magic that restoring batch=1
YYY       yyy         1 1 b yyy             \
.....                                        batch=1
Softmax   out0        1 1 z out0            /
ex.input("in0", in0);

// partial inference until a
ncnn::Mat a;
ex.extract("a", a);

// split a into batches
std::vector<ncnn::Mat> a_chunks = split_batch(a);

// partial subgraph batch inference
std::vector<ncnn::Mat> b_chunks;
for (int i = 0; i < chunks_count; i++)
{
    ex.input("batch_a", a_chunks[i]);
    ex.extract("batch_b", b_chunks[i]);
}

// merge outputs into b
ncnn::Mat b = merge_batch(b_chunks);

// partial inference from b
ex.input("b", b);

ex.extract("out0", out0);

nihui avatar Oct 10 '25 09:10 nihui

ncnn supports partial graph inference by extracting intermediate blobs and resuming inference from those points, and batch inference by running multiple extractors in parallel—each processing one batch independently. However, for workflows like yours (splitting an intermediate tensor, running batch inference on a subgraph such as GRU, and merging outputs), there’s no built-in high-level API; this is typically handled manually in user code using the flexible ncnn::Mat tensor operations and the Extractor API. The reshape operations before and after subgraphs are usually managed at model conversion time (e.g., via pnnx passes), not at runtime, but you can use Mat’s reshape and utility functions for manual batch manipulation as needed FAQ example, Mat implementation.

Recent updates added robust support for ONNX RNNs (including GRU) and improved handling of reshape operations with batch indices, so exporting complex models with batch/reshape logic from PyTorch and converting with pnnx is well-supported GRU support PR, reshape handling PR. Your manual workflow—splitting, subgraph inference, merging—is plausible and aligns with ncnn’s modular design, but isn’t documented as an official pattern. If you need more efficient or alternative approaches, consider automating reshape insertion at conversion time or leveraging parallel extractors for batch processing.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Oct 10 '25 09:10 dosubot[bot]

FROM Gemini 2.5 Pro:

This issue discusses how to perform batch inference on only a specific part (a subgraph) of a larger model graph within the ncnn framework.

The core idea is that during model inference, some parts of the network (like a GRU, as used in the example) need to be run multiple times with batches of data, while the rest of the model runs only once.

Here's a breakdown of the process described:

  1. Create Batch: Before the subgraph that requires batching, a Reshape operation is used to transform the data from a single item (batch=1) into a batch of N items (batch=N).
  2. Partial Batch Inference: The subgraph (e.g., the GRU layers) is then executed on this batch of data.
  3. Restore Batch: After the subgraph completes, another Reshape operation merges the batched output back into a single data stream (batch=1) for the rest of the model to process.

The C++ code snippet illustrates a manual way to achieve this:

  1. Run the inference up to the point just before the batched subgraph (ex.extract("a", a)).
  2. Manually split the intermediate result a into several chunks (a_chunks).
  3. Loop through each chunk, feeding it into the subgraph and extracting the result (ex.input("batch_a", ...) and ex.extract("batch_b", ...)).
  4. Merge the results from all the chunks back into a single tensor (b).
  5. Input this merged tensor b back into the model to continue the inference for the remaining layers.

In essence, the issue raises the need for a more streamlined way to handle models where only certain sections require batch processing, and it proposes a manual, code-driven solution for how it could be done.

nihui avatar Oct 10 '25 09:10 nihui

好麻烦.jpg

nihui avatar Oct 10 '25 10:10 nihui