ggml
ggml copied to clipboard
Magic number in example
https://github.com/ggerganov/ggml/blob/bb8d8cff851b2de6fde4904be492d39458837e1a/examples/simple/simple-ctx.cpp#L29
Can this magic number 1024
be explained, or perhaps improved to some calculation?
Does it depend on the size of the output?
(I notice that if I increase the size of the input tensors this example stops working).
@FSSRepo First off, thanks for contributing this example. Just want to include you on this issue to discuss this. Do you recall why you picked 1024 for this overhead? Can we calculate this instead?
That number is a small extra space for the data since some operations require padding; this is necessary when performing calculations with the context (without using ggml-alloc, which internally adds that small overhead).
As for calculating it, it's just a matter of trying. Try removing it and see what happens.
I was gdb'ing last night, and I saw that when building the graph, memory is allocated from the context's memory pool for the output tensor. It happened somewhere under ggml_mul_mat(). This logic doesn't account for that correct?
If the input is 4096x2 , 2x4096 ... and output is 4096*4096 ... the ctx_size would not have enough space if we don't account for the output tensor size. (This example highlights how the output size can be far greater than the sum of the two inputs).
Also, do we even need to reserve space for the two inputs? They are allocated in the example?
You're right, that 1024 should be the size of the output tensor data. Honestly, I'm not sure how to calculate it correctly before creating the context. @slaren Any idea on how to calculate the compute buffer size before creating the compute graph with the legacy API?
The maximum memory buffer in gpt-2 example is 256 MB:
https://github.com/ggerganov/ggml/blob/98875cdb7e9ceeb726d1c196d2fecb3cbb59b93a/examples/gpt-2/main-ctx.cpp#L409-L429
You would have to pad the size of the tensor to the alignment value. My recommendation is to use ggml-alloc for compute buffers, and ggml_backend_alloc_ctx_tensors
for static tensor buffers, and let it do it for you.
Tangentially, I also wanted to profile the matrix multiplication. I put a loop and timers around this line: https://github.com/ggerganov/ggml/blob/bb8d8cff851b2de6fde4904be492d39458837e1a/examples/simple/simple-ctx.cpp#L66
1000 iterations.
Again, I see the context running out of memory. How could this example be modified to run iteratively?
ggml_graph_compute_with_ctx
uses the context buffer to allocate a work buffer. Calling it repeatedly will cause the work buffer to be allocated on every iteration, until it runs out of memory. This is not a good way to test the performance of an operation since it will include other overheads, such as starting the threads. test-backend-ops
has an option to test the performance of individual ops.