ggml Magic number in example

https://github.com/ggerganov/ggml/blob/bb8d8cff851b2de6fde4904be492d39458837e1a/examples/simple/simple-ctx.cpp#L29

Can this magic number 1024 be explained, or perhaps improved to some calculation?

Does it depend on the size of the output?

(I notice that if I increase the size of the input tensors this example stops working).

Apr 08 '24 23:04 wilderfield

@FSSRepo First off, thanks for contributing this example. Just want to include you on this issue to discuss this. Do you recall why you picked 1024 for this overhead? Can we calculate this instead?

Apr 09 '24 00:04 wilderfield

That number is a small extra space for the data since some operations require padding; this is necessary when performing calculations with the context (without using ggml-alloc, which internally adds that small overhead).

As for calculating it, it's just a matter of trying. Try removing it and see what happens.

Apr 09 '24 00:04 FSSRepo

I was gdb'ing last night, and I saw that when building the graph, memory is allocated from the context's memory pool for the output tensor. It happened somewhere under ggml_mul_mat(). This logic doesn't account for that correct?

If the input is 4096x2 , 2x4096 ... and output is 4096*4096 ... the ctx_size would not have enough space if we don't account for the output tensor size. (This example highlights how the output size can be far greater than the sum of the two inputs).

Also, do we even need to reserve space for the two inputs? They are allocated in the example?

Apr 09 '24 14:04 wilderfield

You're right, that 1024 should be the size of the output tensor data. Honestly, I'm not sure how to calculate it correctly before creating the context. @slaren Any idea on how to calculate the compute buffer size before creating the compute graph with the legacy API?

The maximum memory buffer in gpt-2 example is 256 MB:

https://github.com/ggerganov/ggml/blob/98875cdb7e9ceeb726d1c196d2fecb3cbb59b93a/examples/gpt-2/main-ctx.cpp#L409-L429

Apr 09 '24 17:04 FSSRepo

You would have to pad the size of the tensor to the alignment value. My recommendation is to use ggml-alloc for compute buffers, and ggml_backend_alloc_ctx_tensors for static tensor buffers, and let it do it for you.

Apr 09 '24 18:04 slaren

Tangentially, I also wanted to profile the matrix multiplication. I put a loop and timers around this line: https://github.com/ggerganov/ggml/blob/bb8d8cff851b2de6fde4904be492d39458837e1a/examples/simple/simple-ctx.cpp#L66

1000 iterations.

Again, I see the context running out of memory. How could this example be modified to run iteratively?

Apr 09 '24 19:04 wilderfield

ggml_graph_compute_with_ctx uses the context buffer to allocate a work buffer. Calling it repeatedly will cause the work buffer to be allocated on every iteration, until it runs out of memory. This is not a good way to test the performance of an operation since it will include other overheads, such as starting the threads. test-backend-ops has an option to test the performance of individual ops.

Apr 09 '24 19:04 slaren

ggml ggml copied to clipboard

Magic number in example

ggml
ggml copied to clipboard