CLBlast icon indicating copy to clipboard operation
CLBlast copied to clipboard

Enqueue several calls

Open cabezabuque opened this issue 6 years ago • 13 comments

Hi all, first of all, congrats on a great library! I am implementing a power method, and I would need to perform iterative calls to matrix vector multiplication (cgemv). Currently, each call defines an event to notify when computation is finished, and I need to wait for that event, before I can issue the next multiplication. I am just a newbie, and this might be an extremely stupid question... but: Would it be possible to call CLBlast methods, defining an event that they need to wait for. What I would like to do is the following:

  1. call cgemv(M, V1, V2, event1); [wait for no events]
  2. call my normalization kernell [waits for event1, generates event2 when finished]
  3. call cgemv(M, V1, V2, event3); [waits for event2, notifies event3 when finished]
  4. call my normalization kernell again [waits for event3, generates event4 when finished]
  5. Keep iterating

I might be wrong, but I believe this could result in much better performance (issue all the commands from the CPU at once, let the events trigger other kernells from within OpenCL).

Anyways, I understand this might not be interesting for anyone, but I just wanted to share this thought...

Cheers!

Diego

cabezabuque avatar Nov 21 '18 13:11 cabezabuque

Actually, I don't think this is possible as it is now. Not sure how much gains you will get, because if you wait for the previous call just before the CLBlast API is called, there is not much time lost in between I guess.

The no longer maintained clBLAS does have this feature though, see e.g. here. That is what you are looking for, right? I'll have to think about how much effort it will be to add such a feature to CLBlast as well.

CNugteren avatar Nov 21 '18 19:11 CNugteren

Wow, that was some quick response time... Yeah, the feature in clBlas is exactly what I meant: to let CLBlast calls wait for and trigger other kernells, just like any other OpenCL call... Alternatively, I will imlpement the power method myself in a single kernell and do some profiling to see if I can get better performance. It always surprises me that, as simple as a matrix-vector multiplication looks, I never get close to the performance obtained by implementations of BLAS (what sort of black magic do you guys use??) Cheers!

cabezabuque avatar Nov 21 '18 20:11 cabezabuque

I have thought a little bit about this. I think in general it can be useful to add such a feature as you requested. However, you are so-far still the first to request it. Big problem with implementing this would be that the CLBlast API has to change: all functions will get extra arguments. In the C API this really breaks all existing code, for the C++ API it could be implemented as an optional argument with a default value. The work needed will not be too difficult I guess, but it has to be implemented for each routine so nevertheless it will be quite a bit of work.

What I propose to do for now is that you finalize your implementation using CLBlast. Then, I can add the feature you requested (on a branch of CLBlast) for GEMV only, and you can evaluate if you gain some speed. If it is really significant, I will consider changing the whole C++ API to support this.

CNugteren avatar Nov 24 '18 13:11 CNugteren

Hi! just a couple of comments:

  • Adding optional arguments to the methods would not break the C++ source-code but, would it break already compiled applications? I mean, a binary (.exe) would stop working if it replaced the .dll with the newer version of the .dll (an optional argument still gets translated into a call with a value for that extra parameter, and the .exe would expect a call to a function without those parameters at all). I am not sure about this (never had to do this with any .dll I produced), but I think this could be the case...

  • Would it be better to retain the old interface, and simply add new methods (with the events that the operation needs to wait for)? These methods could be placed in a separate .h file (not to affect older applications) and, over time, maybe one of the two interfaces will become deprecated (depending on users' demands/preferences).

In any case, getting access to a modified version of CLBast would be a blast (I could not help the pun, sorry). If I were to ask, it would be great to also get the method cgemm (I populate the matrices I use for multiplication in another kernel, and this feature would allow me to set the whole "pipeline" of operations at once: fillKernel->cgemm->[cgemv->normaliseKernel]xNUM_ITERATIONS)

Thanks a lot for your support. Also, please do not allow this request to become a source of added stress for you...

Cheers!

Diego

cabezabuque avatar Nov 25 '18 14:11 cabezabuque

Sorry for the late reply, but I finally found some time to implement something. The branch CLBlast-337-events-wait-list now has this optional argument implemented for the C++ interface. It is done for each routine, but actually only functionally implemented in code for GEMV and GEMM. If you want you can test either of those routines and see if it is indeed correct and whether it can speed your application up.

CNugteren avatar Jan 04 '19 20:01 CNugteren

Sorry for reviving such an old thread, but I've also found myself in a position where running CLBlast functions upon completion of other events (like OpenCL kernels can) would be hugely useful. You're probably overloaded with lots of things to do, but if there's any way I can help out, I'm happy to. I haven't looked at the events-list branch yet, not knowing if it's very stale or not.

The feature is useful for me because I wanna build a pretty big computation graph asynchronously and not have to return from the OpenCL device until it's all done. I guess this could sorta-kinda be emulated by doing CLBlast calls asynchronously in a separate thread on the host side, but it adds lots of complexity that is kind of orthogonal to OpenCL.

gspr avatar Sep 03 '21 20:09 gspr

I see now that the changes in the CLBlast-337-events-wait-list branch are rather mechanical, and I could probably get to work on updating that branch to current master head.

I have a question though: Is the intention not to expose the event waiting in the BLAS-like functions? Or am I reading the code wrong? I guess introducing new functions will avoid a SONAME bump, which is of course convenient.

Let me know your thoughts, and I could probably get to work on the nitty gritty part of refreshing that branch (if you'd like).

gspr avatar Sep 06 '21 09:09 gspr

You're probably overloaded with lots of things to do, but if there's any way I can help out, I'm happy to.

CLBlast is not really my priority at the moment, I don't have much free time available for this. But I'm willing to help / answer questions / review and accept a PR.

I see now that the changes in the CLBlast-337-events-wait-list branch are rather mechanical

Indeed, the diff is mostly repetitive. Actually, there is a Python script that generates these files even, so these changes are made automatically. See the diff for details on the changes in the scripts folder: most other changes in C++ files are result of those changes.

and I could probably get to work on updating that branch to current master head.

The good news here is that I'm not sure there is any work, if there is I guess it will be simple to do.

I have a question though: Is the intention not to expose the event waiting in the BLAS-like functions? Or am I reading the code wrong?

I'm not sure what you mean exactly. In my prototype branch I added an argument with the following signature and description: ``const std::vector<cl_event*>& event_wait_list: A list of OpenCL events that need to be executed before this routine's OpenCL kernel(s). This is an optional argument.. So I'm not exposing any internal events, only taking an extra argument. This fits your use-case I guess? The main motivation is that this makes the changes to CLBlast minimal.

But in general I'm not sure we should continue along this path given the extra arguments and such. I don't think it is much work to implement though, but the API changes could impact many users. I'm not sure it will be that helpful in general, although there are now two users already indeed that require this feature...

@gspr : would you be okay with updating that branch so that it is in sync with the latest master and then just using CLBlast from that branch? Or are you using pre-built packages?

CNugteren avatar Sep 07 '21 19:09 CNugteren

Cedric Nugteren @.***> writes:

CLBlast is not really my priority at the moment, I don't have much free time available for this. But I'm willing to help / answer questions / review and accept a PR.

I totally understand. I just wanna make sure any contributions I suggest are not counter to your visions for the (excellent & very useful!) software.

Indeed, the diff is mostly repetitive. Actually, there is a Python script that generates these files even, so these changes are made automatically. See the diff for details on the changes in the scripts folder: most other changes in C++ files are result of those changes.

Thanks!

The good news here is that I'm not sure there is any work, if there is I guess it will be simple to do.

I'll try to have a look the weekend after the coming one.

I'm not sure what you mean exactly. In my prototype branch I added an argument with the following signature and description: ``const std::vector<cl_event*>& event_wait_list: A list of OpenCL events that need to be executed before this routine's OpenCL kernel(s). This is an optional argument.. So I'm not exposing any internal events, only taking an extra argument. This fits your use-case I guess? The main motivation is that this makes the changes to CLBlast minimal.

I see, thanks. I misunderstood from my glance at the code.

But in general I'm not sure we should continue along this path given the extra arguments and such. I don't think it is much work to implement though, but the API changes could impact many users. I'm not sure it will be that helpful in general, although there are now two users already indeed that require this feature...

For the C++ interface it should be fine since the extra args are optional, right? Not so for the C interface though (the one that I personally rely on, for what it's worth). Is this something one could maybe consider bumping to a new major version for?

@gspr : would you be okay with updating that branch so that it is in sync with the latest master and then just using CLBlast from that branch? Or are you using pre-built packages?

For my own personal use, I can of course do that. However, I do think it's generally a significant improvement to the package, and I'm happy to do the work to help with actually making it part of CLBlast proper.

gspr avatar Sep 08 '21 07:09 gspr

Yes you are right, it might make sense to do this. Please go ahead, and let's for starters then only consider the C++ interface, make a PR for that, and then follow-up with the (more breaking) C interface. Let me know if you need any help with the way things are organized and the Python scripts that generated code. Hopefully my prototype branch can be of some help.

CNugteren avatar Sep 10 '21 19:09 CNugteren

Just as a follow-up we make use of CLBlast in PyFR for our OpenCL backend and would benefit greatly from this.

In terms of the C API (which we use) perhaps one option is to add a new set of functions suffixed by "WithEvents" that take the event wait list, leaving the existing functions unchanged?

Also, happy to test any candidate PR's.

FreddieWitherden avatar Dec 11 '21 18:12 FreddieWitherden

I seem to have promised too much when I said I'd look into this. Sorry, more urgent priorities have come up since then. I'm still interested in taking a stab at it if the CLBlast author himself doesn't find time.

gspr avatar Dec 13 '21 10:12 gspr

I have not had any time either; but would be able to test any branches. Is the plan to provide additional wrappers for all of the major functions or just the heavyweights from level 3 BLAS?

FreddieWitherden avatar Mar 08 '22 20:03 FreddieWitherden