llama.cpp
llama.cpp copied to clipboard
ggml_graph_compute: deprecate using ggml_context, try resolve issue #287
Try resolve https://github.com/ggerganov/ggml/issues/287
Intro
The design is a bit different to the suggested one: named the buffer type as a generalized one:ggml_cgraph_context.
struct ggml_cgraph_context {
size_t work_size;
void * work_data;
bool planned; // true means ready to compute graph nodes.
};
I'll explain planned later, let's focus on the APIs.
The first parameter ctx of ggml_graph_compute() is deprecated (pass in NULL is OK).
Removed wsize and work from ggml_cgraph, unlikely break external users because no reason to use them directly.
To avoid break external users, can not simply change the signature of ggml_graph_compute(), have to add ggml_graph_compute_v2(), the name looks weird but this is the reality :/ Now ggml_graph_compute() is a thin wrapper of `ggml_graph_compute_v2().
Extracted codes to new function ggml_graph_compute_plan(): set node->n_tasks, calculate work_size.
Usage
struct ggml_cgraph_context ctx = {0};
// case 1:
ggml_graph_compute_plan(&ctx, cgraph);
if (ctx.work_size > 0) {
ctx.work_data = malloc(ctx.work_size);
}
ggml_graph_compute_v2(&ctx, cgraph);
// case 2:
ggml_graph_compute_v2(&ctx, cgraph);
// case 3:
ggml_graph_compute_v2(NULL, cgraph);
// case 4:
ggml_graph_compute(ggml_context *ctx, cgraph);
// case 5:
ggml_graph_compute(NULL, cgraph);
Why did I add the field planned?
Because:
- The
ctxis allowed to be NULL, empty or initialized byggml_graph_compute_plan(). - One of the corner cases is: the
work_sizeandwork_datacan be default values even if the plan has been called. So we can not simply determine whether or not callggml_graph_compute_plan()by default values. ggml_graph_compute_plan()MUST be called because it also setsnode->n_tasks. Thework_sizedepends onn_tasks.
The planned makes plan-compute sequence stateful, not good enough. Any ideas?
ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.
I think that n_tasks should be removed from ggml_tensor. For now, the easiest way to address may be to move it to the new ggml_cgraph_context, but eventually I think that this is something that ggml_graph_compute should be able to determine automatically when executing a node.
Good job!
I think that
n_tasksshould be removed fromggml_tensor.
Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.
So, create another data structure for example tensor_compute_config, 1:1 mapping to graph nodes, place them inside ggml_cgraph_context, during computing nodes, set params.nth as the n_tasks from the sibling i-th config object. It is Right?
Of course,
n_tasksshould belong to the compute facility I think, it's ideal to migrate to some place else.
Yes precisely! That's what I was thinking as well.
I am not sure that we need a new struct for this, is there any other data from the tensor would belong there? I am thinking that adding a simple int n_tasks[GGML_MAX_NODES] to ggml_cgraph_context with a 1:1 mapping between the graph nodes should do it. It is not great, but for now it will solve the problem, and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.
is there any other data from the tensor would belong there?
No way.
adding a simple
int n_tasks[GGML_MAX_NODES]
Great! The n_tasks array is good enough.
and eventually I think that we can remove it and have
ggml_graph_computedetermine this automatically instead.
That's will be a long story. I had tried add another enum to ggml_task_type named GGML_TASK_CONFIG, ggml_graph_compute() runs the planning before computing by calling ggml_graph_forward() for every node, similar to the planning idea in this PR. At that time I prefer this kind of design because actually the computing codes should know exactly what they can provide (parallel?) and want what are required (wsize?). That's easy to maintain-- just have a glance at the actual codes, less chances to mis-config.
I think ggml_graph_compute() MAY NOT necessary to hard code everything in the long run, it's better to build some kinds of new protocol/contract between the scheduler and actual computing codes. To achieve this, we can construct another slightly abstract data structure of on top of {node, params}, with methods such as config(), compute(). With this design, we could create and register several compute vendors (with id/name for example) for every kind of node, this provide great flexibility for evaluating experimental algorithms, benchmarking. With the compute vendor id, we are hopefully able to define a compute plan offline (if this is unnecessary, at least we can specify certain vendors) , this is similar to the dumped cgraph.
Just some random thoughts :D
That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute ggml_cgraphs directly, and what it needs to do that is very similar to this, a method to "prepare" or "plan" the graph, and another to execute the "prepared" graph. Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.
While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.
Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.
Yes, I saw the changes during the last two months, getting better and better. Both CUDA and CL implementations impose potential de-coupling requirements.
While doing this, I think that we should also remove
n_threadsfromggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.
Sure, similar to n_tasks in tensor as well, but no idea where to put it when compute, hello ggml_cgraph_context?
At present, I'm not sure but looks like for cross cgraph computing, the n_threads is still required, especially when a node runner is allowed to hijack the computing flow or fallback to default vendor -- the CUDA way. In cross cgraph context, n_threads or thread_ctx should be owned by something like compute session?
I think it's OK to leave n_threads there for a while, GPU codes can simply ignore it, unless the GPU implementation has it's complete cgraph that is not shared with CPU, this is yet another long story.
What I was thinking is that n_threads could be a parameter to ggml_graph_compute_plan, and it would also be stored in ggml_cgraph_context for use by ggml_graph_compute.
For now, the CUDA runner can still operate in the same way unaffected by this, but in the long run this method of hijacking the compute flow should be removed. Mixed CPU/GPU execution can be achieved in other ways, such as by splitting the computation into multiple graphs, each of them executed by a different backend.
I like the idea. However, I am not favoring the implementation:
- We should not rely on consumer's behavior says don't touch plan. We should not expose cgraph_compute_context at all or maintain the state elsewhere.
- a common API pattern for external provided buffer is like: Pass NULL as buf, and buffer size into API. API will return the size it expected but don't do the real computation. Pass a valid pointer but the size is less than expected, the function will fail in the same way.
- plan is not a clear to indicate this function needs to be called before compute API. compute_v2 takes a result of this plan will make the API explict. also maybe create compute_context is better API name. plan is too general.
I am not favoring the implementation
Yes, totally make points. A general context is used for future extensibility, but should be seen as over design. If you have read previous comments you would get the conclusion that a general context tends be "extended" as a basket.
So, if we can't tell what we want beyond current requirement, a specialized API should be the best choice. Anyway, I don't think plan is a bad idea, the actual problem is we haven't decouple the calculating wsize from setting n_tasks. Perhaps we'd reconsider this design after splitting that part of codes.
I think that n_tasks should be removed from ggml_tensor
Agree
Good points by @howard0su
Overall, I think this is on the right track and we should finalize the PR and merge it.
There is no need to worry about backwards compatibility (i.e. the "_v2" stuff is not necessary). Just assume that we have just started developing the API and nobody has been using it and expect it to be compatible. We will start to worry about this kind of things after the v1.0 release sometime in the future.
There is no need to worry about backwards compatibility
This is great, I'll rewrite this PR, show you later.
The PR description was updated. You may want to have a look at it. Apart from main and perplexity, also tested:
- tests/test-grad0
- tests/test-opt
- examples/benchmark
- examples/baby-llama
- examples/train-text-from-scratch (force stop at Example 2, opt iter 29)
No crash, all of them output reasonable text.
So this PR is ready to review again.
Looks good, I only have a few minor nits:
- In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a
std::vectorinlama_context. Justresizeit to thework_size, and if it is already big enough, it won't reallocate memory. - More generally, in all the C++ code, I would suggest always using
std::vectorinstead ofmalloc/free - I am not sure about
ggml_graph_compute_sugar
Thanks @slaren
- In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a
std::vectorinlama_context. Justresizeit to thework_size, and if it is already big enough, it won't reallocate memory.- More generally, in all the C++ code, I would suggest always using
std::vectorinstead ofmalloc/free
Make sense, this is also one of my concerns. I'll try the std:vector way.
- I am not sure about
ggml_graph_compute_sugar
The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.
https://github.com/ggerganov/llama.cpp/pull/1999/commits/bf63002af94bc8fe39bca1a44e1344fad420f96b is the latest commit per @slaren 's suggestion.
TODO: The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.
Rebased on to master.
rebased
I've refactored the changes:
- add
ggml_graph_compute_with_ctx()- similar to the old way of graph computation - rename
struct ggml_graph_compute_plantostruct ggml_cplan - rename
ggml_graph_compute_make_plan()toggml_graph_plan() - factored common "plan + compute" code in a helper function
ggml_graph_compute_helper() - enabled
test-grad0test - fixed Metal build
Overall the change is good.
A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.
@mqy @slaren Please take a look
rename ggml_graph_compute_make_plan() to ggml_graph_plan()
I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.
I think this looks good.
A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.
Eventually we should support custom ops on the GPU backends too. It will be slow since it will require copying data back and forth from the device, but that's still a lot better than not supporting them at all. So whenever we do this, I think we should do it in a backend-agnostic way, rather than tying it to the CPU backend.
I've refactored the changes:
Looks good. Verified main and test-grad0.
A positive side-effect is that the user can now control the number of tasks for each op.
- Increasing n_tasks[i] after plan may cause work buffer overflow.
- For OPs that can only be run with n_tasks == 1, it's UB if run with n_tasks > 1.
Anyway, better than never.
I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.
Good suggestion.
I like grouping names with common prefixes and currently ggml_graph_plan() fits neatly with the rest: ggml_graph_compute(), ggml_graph_reset() so I prefer not to isolate it in a new prefix just yet. If we start doing more stuff with the plans, like ggml_cplan_optimize(), ggml_cplan_print(), etc, we can rename it.
Btw, this reminds me that ggml_graph_xxx probably should have been ggml_cgraph_xxx. Shall we change it?
Increasing n_tasks[i] after plan may cause work buffer overflow.
Ah good point. So the developer has to be careful and modify n_tasks only for their own custom operators or be familiar with the inner workings of the ggml ops.