ggml Batch inference

I am implementing batch inference support in monatis/clip.cpp#30. This is still WIP but but almost there. I removed ne[3] == 1 asserts and generalized Conv2D to N-batched src1. I also included #224 in this PR for batched matrix multiplication. I believe the only thing to make batchwise is normalization now --I hope to solve it very soon.

Hope it might be a playground to test batch inference.

Jun 30 '23 20:06 monatis

I'm surprised by ggml_norm. It works in the feature dimension, e.g., ne00, as it should, but it gives a different result for the second one of two identical samples in a batch. For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116. Similarly the output vector for the batch with a single sample and the output vector 0 for the batch with two identical samples are exactly the same, but the output vector 1 is different even though the input sample is the same. And, the only place that they differ seems to be ggml_norm, but I couldn't see why.

Jul 02 '23 10:07 monatis

For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116

Could be round-off errors. Are you accumulating the sum into a double?

Jul 02 '23 18:07 ggerganov

Are you accumulating the sum into a double?

Yes it's just for debugging so I accumulate the sum to a double. And the actual issue is, output vectors for identical images batched together are different, the sum is just a proxy to understand where they start to diverge. The first place where the sum of the two-sample batch is not two times the sum of the single-sample batch is just after the first ggml_norm op. Before that, modified conv_2d_sk_p0 with batch support, concat op with ggml_acc and adding positional embeddings with ggml_add all produces the expected values. But I can see no reason why ggml_norm leads to this. I will revisit the other ops prior to ggml_norm with a fresh mind if we don't have any better clue.

Jul 02 '23 18:07 monatis

Yes, it turned out to be that I calculated a wrong value for the offset to be added to the wdata pointer in conv_2d_sk_p0. So even if input 0 gives a correct output, input 1 gives a wrong one. And a note to future me, the sum of a tensor is not always reliable for debugging :)

Now I can confirm the batch inference works correctly. I'll solve conflicts and then mark this ready for review.

Jul 03 '23 15:07 monatis

This is ready for review now. Working usage is in monatis/clip.cpp#30

As a future work, I'll implement batch inference for text encoding in clip.cpp, which will require some extra work for attention masking to handle inputs with varying lengths, and probably generalization of ggml_get_rows to get position embeddings, so I'll leave it to another PR.

CC @ggerganov

Jul 03 '23 21:07 monatis

Great!

Will come back to reviewing this soon. Need to merge some refactoring PRs first and sync things with llama.cpp. Also need to see if the proposed change for mul mat broadcast in https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606377522 is the correct way to broadcast and apply the changes if it is

Jul 04 '23 18:07 ggerganov

fixed the conflicts after syncing with llama.cpp yesterday.

Jul 06 '23 11:07 monatis

@monatis do you have any benchmarks on performance with and without batch inference by any chance ?

Jul 11 '23 09:07 okpatil4u

@monatis

I'll probably take a look tomorrow and merge if everything is good

Jul 11 '23 18:07 ggerganov

@okpatil4u I updated the benchmark utility to make use of the proposed batch inference in monatis/clip.cpp#30 and got a ~1.4x speedup in per-image inference time compared to the main branch (190 ms down to 136 ms / image over 10k images). And I can see numerous other places we can further improve it soon. This is exactly for it --we will have a playground where we can test and benchmark ideas for batch inference and hopefully improve the performance further.

@ggerganov

I'll probably take a look tomorrow and merge if everything is good

Sounds great. We have conflicts again, I'll fix them tomorrow.

Jul 12 '23 00:07 monatis

@monatis thanks. This is promising. Was the batch size chosen automatically or does user have to define it ?

Jul 12 '23 07:07 okpatil4u

@okpatil4u currently the signature of the batch encoding function is: bool clip_image_batch_encode(const clip_ctx *ctx, int n_threads, const std::vector<clip_image_f32> &imgs, float *vec), and the batch size is essentially imgs.size(), i.e., it processes all the images in imgs as a single batch. We can also improve it further by implementing the function so it can process N mini batches of size B from a N * B-sized std::vector of images to reuse the graph, but I'll leave the experiment for it to a future PR.

Jul 12 '23 12:07 monatis

@ggerganov conflicts fixed and tests passed. Working usage is in the benchmark binary: https://github.com/monatis/clip.cpp/blob/8c09113e154b8f3a589a47d5780a19e4546c227a/tests/benchmark.cpp#L122-L126

Jul 12 '23 12:07 monatis

ggml ggml copied to clipboard

Batch inference

ggml
ggml copied to clipboard