ggml
ggml copied to clipboard
callback to abort ggml_graph_compute()
closes #308
This assumes that once the callback returns true, it will continue to return true
Thanks for the quick response!
pthread_join waits for the thread to exit, so it would require every thread to check the abort callback. Because that seemed inefficient, i decided to only have the main thread check it and clean everything up.
Really good idea! We can use this quite well in ggml-gobject with GCancellable.
Just weighing in here - the callback should accept some user_data pointer. Eg:
void ggml_graph_compute_with_abort(struct ggml_context * ctx, struct ggml_cgraph * cgraph, bool (*abort_callback)(void *), void *abort_callback_data)
and
if (abort_callback(abort_callback_data)) { ... }
This is important for the case where we need to query some external data (eg, the GCancellable state in my case) in order to determine whether to abort computation.
pthread_join waits for the thread to exit, so it would require every thread to check the abort callback. Because that seemed inefficient, i decided to only have the main thread check it and clean everything up.
Adding pthread_cancel would also need some Windows-specific implementation and as far as I can understand how this call works, it can leave things into intermediate state. I could be misunderstanding, but I prefer to avoid it.
The overhead from every thread checking the callback should not be too big
pthread_cancel and equivalent are not available on all platforms too (thinking of android here)
Even if pthread_cancel or sig USER works, that are async way to force stopping running threads.
Async is not easy to control, thinks about not all threads are killed at once, some are stopping, but some are still running.
Graceful stop is preferred if possible, not only because we can control details, but also because we can implement it.
ggml_graph_comput()execute node runners one by one, thus we have natural checkpoints between node computing. average node computing is fast; there are hundreds of mul_mat in every graph compute, even if single mul_mat is slow, chances we abort at next node.- in https://github.com/ggerganov/llama.cpp/pull/1999/, the
ggml_graph_compute_planis passed intoggml_graph_compute()andggml_graph_compute_thread(). To notify thread group, we simply add a fieldabortintoggml_graph_compute_plan.
I see, pthread_cancel seems like a bad idea. Earlier today I changed it so that every thread calls the callback and immediately returns. This might be graceful?
@mqy that's also a good way to do it. Unsure which way to go.
I see,
pthread_cancelseems like a bad idea. Earlier today I changed it so that every thread calls the callback and immediately returns. This might be graceful?
Me too, sorry, I saw your new commit after submitting comment :) The callback is graceful.
Unsure which way to go.
I can't say yes or no, it depends. Anyway, please let me ask some questions:
- to determine whether abort or not, compute threads have to call an external function with opaque data, this is unnecessary, a bool flag is enough, right?
- why the new
ggml_graph_compute_with_abort()? to avoid changing signature ofggml_graph_compute?
Overall I think it's not good to create ggml_graph_compute_with_abort() which changed the compute implementation even if just a thin wrapper. It should be seen as anti-pattern. Suppose in the future we want to support suspend/resume, should we add ggml_graph_compute_with_suspend and ggml_graph_compute_with_resume?
So, I suggest adding state management APIs, for example: ggml_graph_compute_abort(ctx_or_plan) or more generic ggml_graph_compute_set_state(ctx_or_plan, ABORT).
I can't say yes or no, it depends. Anyway, please let me ask some questions:
- to determine whether abort or not, compute threads have to call an external function with opaque data, this is unnecessary, a bool flag is enough, right?
I think using a callback for aborting is pretty standard (e.g. tensorflow/keras). It's also easier for a user to pass in a function rather than having to create a separate thread and modifying some shared state to abort. By his comment earlier, I think @smspillaz plans on using a callback, but I'm unsure.
- why the new
ggml_graph_compute_with_abort()? to avoid changing signature ofggml_graph_compute?Overall I think it's not good to create
ggml_graph_compute_with_abort()which changed the compute implementation even if just a thin wrapper. It should be seen as anti-pattern. Suppose in the future we want to support suspend/resume, should we add ggml_graph_compute_with_suspend and ggml_graph_compute_with_resume?So, I suggest adding state management APIs, for example: ggml_graph_compute_abort(ctx_or_plan) or more generic ggml_graph_compute_set_state(ctx_or_plan, ABORT).
Agreed! The context is the best place to store it.
We should wait until ggml: breaking change to ggml_graph_compute(): planning is required before computing llama.cpp#1999 is merged or just add this feature into it.
I think using a callback for aborting is pretty standard
Sure, callback is more versatile, and convenient to use, this is why I said it depends.
Overall, if we consider conditional abort is important, I think the callback design wins.
@CCLDArjun
After merging https://github.com/ggerganov/llama.cpp/pull/1999 the ggml_graph_compute() API has changed now so the PR needs to be adapted.
I think we should add the abort callback to the new struct ggml_cplan. The user will set the callback and the data by directly accessing the members of the struct. The new ggml_graph_compute_with_ctx() will not support abort for now as we want to eventually move away from it anyway.
I was waiting for that to be merged so I could add it onto struct ggml_cplan but I didn't realize that it had been merged.
Just made the changes, @ggerganov