ggml callback to abort ggml_graph

closes #308

This assumes that once the callback returns true, it will continue to return true

Jul 02 '23 13:07 CCLDArjun

Thanks for the quick response!

pthread_join waits for the thread to exit, so it would require every thread to check the abort callback. Because that seemed inefficient, i decided to only have the main thread check it and clean everything up.

Jul 03 '23 07:07 CCLDArjun

Really good idea! We can use this quite well in ggml-gobject with GCancellable.

Just weighing in here - the callback should accept some user_data pointer. Eg:

void ggml_graph_compute_with_abort(struct ggml_context * ctx, struct ggml_cgraph * cgraph, bool (*abort_callback)(void *), void *abort_callback_data)

and

if (abort_callback(abort_callback_data)) { ... }

This is important for the case where we need to query some external data (eg, the GCancellable state in my case) in order to determine whether to abort computation.

Jul 03 '23 09:07 smspillaz

pthread_join waits for the thread to exit, so it would require every thread to check the abort callback. Because that seemed inefficient, i decided to only have the main thread check it and clean everything up.

Adding pthread_cancel would also need some Windows-specific implementation and as far as I can understand how this call works, it can leave things into intermediate state. I could be misunderstanding, but I prefer to avoid it.

The overhead from every thread checking the callback should not be too big

Jul 04 '23 17:07 ggerganov

pthread_cancel and equivalent are not available on all platforms too (thinking of android here)

Jul 05 '23 08:07 Green-Sky

Even if pthread_cancel or sig USER works, that are async way to force stopping running threads. Async is not easy to control, thinks about not all threads are killed at once, some are stopping, but some are still running.

Graceful stop is preferred if possible, not only because we can control details, but also because we can implement it.

ggml_graph_comput() execute node runners one by one, thus we have natural checkpoints between node computing. average node computing is fast; there are hundreds of mul_mat in every graph compute, even if single mul_mat is slow, chances we abort at next node.
in https://github.com/ggerganov/llama.cpp/pull/1999/, the ggml_graph_compute_plan is passed into ggml_graph_compute() and ggml_graph_compute_thread(). To notify thread group, we simply add a field abort into ggml_graph_compute_plan.

Jul 05 '23 12:07 mqy

I see, pthread_cancel seems like a bad idea. Earlier today I changed it so that every thread calls the callback and immediately returns. This might be graceful?

@mqy that's also a good way to do it. Unsure which way to go.

Jul 05 '23 13:07 CCLDArjun

I see, pthread_cancel seems like a bad idea. Earlier today I changed it so that every thread calls the callback and immediately returns. This might be graceful?

Me too, sorry, I saw your new commit after submitting comment :) The callback is graceful.

Unsure which way to go.

I can't say yes or no, it depends. Anyway, please let me ask some questions:

to determine whether abort or not, compute threads have to call an external function with opaque data, this is unnecessary, a bool flag is enough, right?
why the new ggml_graph_compute_with_abort()? to avoid changing signature of ggml_graph_compute?

Overall I think it's not good to create ggml_graph_compute_with_abort() which changed the compute implementation even if just a thin wrapper. It should be seen as anti-pattern. Suppose in the future we want to support suspend/resume, should we add ggml_graph_compute_with_suspend and ggml_graph_compute_with_resume?

So, I suggest adding state management APIs, for example: ggml_graph_compute_abort(ctx_or_plan) or more generic ggml_graph_compute_set_state(ctx_or_plan, ABORT).

Jul 05 '23 21:07 mqy

I can't say yes or no, it depends. Anyway, please let me ask some questions:

to determine whether abort or not, compute threads have to call an external function with opaque data, this is unnecessary, a bool flag is enough, right?

I think using a callback for aborting is pretty standard (e.g. tensorflow/keras). It's also easier for a user to pass in a function rather than having to create a separate thread and modifying some shared state to abort. By his comment earlier, I think @smspillaz plans on using a callback, but I'm unsure.

why the new ggml_graph_compute_with_abort()? to avoid changing signature of ggml_graph_compute?

Overall I think it's not good to create ggml_graph_compute_with_abort() which changed the compute implementation even if just a thin wrapper. It should be seen as anti-pattern. Suppose in the future we want to support suspend/resume, should we add ggml_graph_compute_with_suspend and ggml_graph_compute_with_resume?

So, I suggest adding state management APIs, for example: ggml_graph_compute_abort(ctx_or_plan) or more generic ggml_graph_compute_set_state(ctx_or_plan, ABORT).

Agreed! The context is the best place to store it.

Jul 06 '23 05:07 CCLDArjun

We should wait until ggml: breaking change to ggml_graph_compute(): planning is required before computing llama.cpp#1999 is merged or just add this feature into it.

Jul 06 '23 05:07 CCLDArjun

I think using a callback for aborting is pretty standard

Sure, callback is more versatile, and convenient to use, this is why I said it depends.

Overall, if we consider conditional abort is important, I think the callback design wins.

Jul 06 '23 13:07 mqy

@CCLDArjun

After merging https://github.com/ggerganov/llama.cpp/pull/1999 the ggml_graph_compute() API has changed now so the PR needs to be adapted.

I think we should add the abort callback to the new struct ggml_cplan. The user will set the callback and the data by directly accessing the members of the struct. The new ggml_graph_compute_with_ctx() will not support abort for now as we want to eventually move away from it anyway.

Jul 11 '23 16:07 ggerganov

I was waiting for that to be merged so I could add it onto struct ggml_cplan but I didn't realize that it had been merged.

Just made the changes, @ggerganov

Jul 11 '23 19:07 CCLDArjun

ggml
ggml copied to clipboard

callback to abort ggml_graph_compute()

ggml ggml copied to clipboard

callback to abort ggml_graph_compute()

ggml
ggml copied to clipboard