llama.cpp ggml_graph_compute: deprecate using ggml

Try resolve https://github.com/ggerganov/ggml/issues/287

Intro

The design is a bit different to the suggested one: named the buffer type as a generalized one:ggml_cgraph_context.

struct ggml_cgraph_context {
    size_t work_size;
    void * work_data;
    bool   planned; // true means ready to compute graph nodes.
};

I'll explain planned later, let's focus on the APIs.

The first parameter ctx of ggml_graph_compute() is deprecated (pass in NULL is OK). Removed wsize and work from ggml_cgraph, unlikely break external users because no reason to use them directly.

To avoid break external users, can not simply change the signature of ggml_graph_compute(), have to add ggml_graph_compute_v2(), the name looks weird but this is the reality :/ Now ggml_graph_compute() is a thin wrapper of `ggml_graph_compute_v2().

Extracted codes to new function ggml_graph_compute_plan(): set node->n_tasks, calculate work_size.

Usage

struct ggml_cgraph_context ctx = {0};

// case 1:
ggml_graph_compute_plan(&ctx, cgraph);
if (ctx.work_size > 0) {
    ctx.work_data = malloc(ctx.work_size);
}
ggml_graph_compute_v2(&ctx, cgraph);

// case 2:
ggml_graph_compute_v2(&ctx, cgraph);

// case 3:
ggml_graph_compute_v2(NULL, cgraph);

// case 4:
ggml_graph_compute(ggml_context *ctx, cgraph);

// case 5:
ggml_graph_compute(NULL, cgraph);

Why did I add the field `planned`?

Because:

The ctx is allowed to be NULL, empty or initialized by ggml_graph_compute_plan().
One of the corner cases is: the work_size and work_data can be default values even if the plan has been called. So we can not simply determine whether or not call ggml_graph_compute_plan() by default values.
ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

The planned makes plan-compute sequence stateful, not good enough. Any ideas?

Jun 26 '23 07:06 mqy

ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

I think that n_tasks should be removed from ggml_tensor. For now, the easiest way to address may be to move it to the new ggml_cgraph_context, but eventually I think that this is something that ggml_graph_compute should be able to determine automatically when executing a node.

Good job!

Jun 26 '23 08:06 slaren

I think that n_tasks should be removed from ggml_tensor.

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

So, create another data structure for example tensor_compute_config, 1:1 mapping to graph nodes, place them inside ggml_cgraph_context, during computing nodes, set params.nth as the n_tasks from the sibling i-th config object. It is Right?

Jun 26 '23 08:06 mqy

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

Yes precisely! That's what I was thinking as well.

I am not sure that we need a new struct for this, is there any other data from the tensor would belong there? I am thinking that adding a simple int n_tasks[GGML_MAX_NODES] to ggml_cgraph_context with a 1:1 mapping between the graph nodes should do it. It is not great, but for now it will solve the problem, and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

Jun 26 '23 08:06 slaren

is there any other data from the tensor would belong there?

No way.

adding a simple int n_tasks[GGML_MAX_NODES]

Great! The n_tasks array is good enough.

and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

That's will be a long story. I had tried add another enum to ggml_task_type named GGML_TASK_CONFIG, ggml_graph_compute() runs the planning before computing by calling ggml_graph_forward() for every node, similar to the planning idea in this PR. At that time I prefer this kind of design because actually the computing codes should know exactly what they can provide (parallel?) and ~~want~~ what are required (wsize?). That's easy to maintain-- just have a glance at the actual codes, less chances to mis-config.

I think ggml_graph_compute() MAY NOT necessary to hard code everything in the long run, it's better to build some kinds of new protocol/contract between the scheduler and actual computing codes. To achieve this, we can construct another slightly abstract data structure of on top of {node, params}, with methods such as config(), compute(). With this design, we could create and register several compute vendors (with id/name for example) for every kind of node, this provide great flexibility for evaluating experimental algorithms, benchmarking. With the compute vendor id, we are hopefully able to define a compute plan offline (if this is unnecessary, at least we can specify certain vendors) , this is similar to the dumped cgraph.

Just some random thoughts :D

Jun 26 '23 09:06 mqy

That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute ggml_cgraphs directly, and what it needs to do that is very similar to this, a method to "prepare" or "plan" the graph, and another to execute the "prepared" graph. Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

Jun 26 '23 09:06 slaren

Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

Yes, I saw the changes during the last two months, getting better and better. Both CUDA and CL implementations impose potential de-coupling requirements.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

Sure, similar to n_tasks in tensor as well, but no idea where to put it when compute, hello ggml_cgraph_context?

At present, I'm not sure but looks like for cross cgraph computing, the n_threads is still required, especially when a node runner is allowed to hijack the computing flow or fallback to default vendor -- the CUDA way. In cross cgraph context, n_threads or thread_ctx should be owned by something like compute session?

I think it's OK to leave n_threads there for a while, GPU codes can simply ignore it, unless the GPU implementation has it's complete cgraph that is not shared with CPU, this is yet another long story.

Jun 26 '23 10:06 mqy

What I was thinking is that n_threads could be a parameter to ggml_graph_compute_plan, and it would also be stored in ggml_cgraph_context for use by ggml_graph_compute.

For now, the CUDA runner can still operate in the same way unaffected by this, but in the long run this method of hijacking the compute flow should be removed. Mixed CPU/GPU execution can be achieved in other ways, such as by splitting the computation into multiple graphs, each of them executed by a different backend.

Jun 26 '23 10:06 slaren

I like the idea. However, I am not favoring the implementation:

We should not rely on consumer's behavior says don't touch plan. We should not expose cgraph_compute_context at all or maintain the state elsewhere.
a common API pattern for external provided buffer is like: Pass NULL as buf, and buffer size into API. API will return the size it expected but don't do the real computation. Pass a valid pointer but the size is less than expected, the function will fail in the same way.
plan is not a clear to indicate this function needs to be called before compute API. compute_v2 takes a result of this plan will make the API explict. also maybe create compute_context is better API name. plan is too general.

Jun 29 '23 16:06 howard0su

I am not favoring the implementation

Yes, totally make points. A general context is used for future extensibility, but should be seen as over design. If you have read previous comments you would get the conclusion that a general context tends be "extended" as a basket.

So, if we can't tell what we want beyond current requirement, a specialized API should be the best choice. Anyway, I don't think plan is a bad idea, the actual problem is we haven't decouple the calculating wsize from setting n_tasks. Perhaps we'd reconsider this design after splitting that part of codes.

Jun 29 '23 16:06 mqy

I think that n_tasks should be removed from ggml_tensor

Agree

Good points by @howard0su

Overall, I think this is on the right track and we should finalize the PR and merge it.

There is no need to worry about backwards compatibility (i.e. the "_v2" stuff is not necessary). Just assume that we have just started developing the API and nobody has been using it and expect it to be compatible. We will start to worry about this kind of things after the v1.0 release sometime in the future.

Jul 01 '23 16:07 ggerganov

There is no need to worry about backwards compatibility

This is great, I'll rewrite this PR, show you later.

Jul 03 '23 01:07 mqy

The PR description was updated. You may want to have a look at it. Apart from main and perplexity, also tested:

tests/test-grad0
tests/test-opt
examples/benchmark
examples/baby-llama
examples/train-text-from-scratch (force stop at Example 2, opt iter 29)

No crash, all of them output reasonable text.

So this PR is ready to review again.

Jul 03 '23 12:07 mqy

Looks good, I only have a few minor nits:

In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.
More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free
I am not sure about ggml_graph_compute_sugar

Jul 04 '23 09:07 slaren

Thanks @slaren

In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.

More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free

Make sense, this is also one of my concerns. I'll try the std:vector way.

I am not sure about ggml_graph_compute_sugar

The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

Jul 04 '23 09:07 mqy

https://github.com/ggerganov/llama.cpp/pull/1999/commits/bf63002af94bc8fe39bca1a44e1344fad420f96b is the latest commit per @slaren 's suggestion.

TODO: The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

Rebased on to master.

Jul 04 '23 19:07 mqy

rebased

Jul 06 '23 05:07 mqy

I've refactored the changes:

add ggml_graph_compute_with_ctx() - similar to the old way of graph computation
rename struct ggml_graph_compute_plan to struct ggml_cplan
rename ggml_graph_compute_make_plan() to ggml_graph_plan()
factored common "plan + compute" code in a helper function ggml_graph_compute_helper()
enabled test-grad0 test
fixed Metal build

Overall the change is good. A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

@mqy @slaren Please take a look

Jul 06 '23 18:07 ggerganov

rename ggml_graph_compute_make_plan() to ggml_graph_plan()

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

Jul 06 '23 20:07 monatis

I think this looks good.

A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

Eventually we should support custom ops on the GPU backends too. It will be slow since it will require copying data back and forth from the device, but that's still a lot better than not supporting them at all. So whenever we do this, I think we should do it in a backend-agnostic way, rather than tying it to the CPU backend.

Jul 06 '23 21:07 slaren

I've refactored the changes:

Looks good. Verified main and test-grad0.

A positive side-effect is that the user can now control the number of tasks for each op.

Increasing n_tasks[i] after plan may cause work buffer overflow.
For OPs that can only be run with n_tasks == 1, it's UB if run with n_tasks > 1.

Anyway, better than never.

Jul 07 '23 06:07 mqy

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

Good suggestion. I like grouping names with common prefixes and currently ggml_graph_plan() fits neatly with the rest: ggml_graph_compute(), ggml_graph_reset() so I prefer not to isolate it in a new prefix just yet. If we start doing more stuff with the plans, like ggml_cplan_optimize(), ggml_cplan_print(), etc, we can rename it.

Btw, this reminds me that ggml_graph_xxx probably should have been ggml_cgraph_xxx. Shall we change it?

Increasing n_tasks[i] after plan may cause work buffer overflow.

Ah good point. So the developer has to be careful and modify n_tasks only for their own custom operators or be familiar with the inner workings of the ggml ops.

Jul 07 '23 16:07 ggerganov

llama.cpp
llama.cpp copied to clipboard

ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

Intro

Usage

Why did I add the field `planned`?

llama.cpp llama.cpp copied to clipboard

ggml_graph_compute: deprecate using ggml_context, try resolve issue #287

Intro

Usage

Why did I add the field planned?

llama.cpp
llama.cpp copied to clipboard

Why did I add the field `planned`?