TornadoVM
TornadoVM copied to clipboard
Adding asynchronous execution to TaskSchedule
Currently TaskSchedule API contains only blocking versions of execute methods:
void execute() ;
void execute(GridTask gridTask);
void executeWithProfiler(Policy policy);
void executeWithProfilerSequential(Policy policy);
void executeWithProfilerSequentialGlobal(Policy policy);
All these methods block currently executing Java thread till all computations are done. However, computations are typically off-loaded to GPU and CPU at this moment just keeps waiting.
I think it should be both beneficial and possible to add asynchronous versions of the same methods with the following signatures:
CompletableFuture executeAsync() ;
CompletableFuture executeAsync(GridTask gridTask);
CompletableFuture executeWithProfilerAsync(Policy policy);
CompletableFuture executeWithProfilerSequentialAsyn(Policy policy); // Not sure about "sequential async"
CompletableFuture executeWithProfilerSequentialGlobalAsync(Policy policy); // Not sure about "sequential async"
Thoughts about implementation
Per my understanding, TaskSchedule delegates back to TornadoTaskSchedule
And this objects waits on Event event object (driver-specific). There are specific classes in each driver- CLEvent and PTXEvent
OpenCL provides clSetEventCallback, CUDA has cudaLaunchHostFunc -- so it's possible to get async notifications from both OpenCL and PTX drivers.
So it should be possible to extend CLEvent and PTXEvent + PTXStream to add some form of listeners, where concrete listener inside TornadoTaskSchedule can settle CompletableFuture returned from the proposed TaskSchedule.executeAsync().
Thought?
We wanted to add this functionality. We prefer to keep the CLEvent and PTXEvent classes with low-level functionality. However, using a callback function makes sense.
What we were thinking is the async methods, as you proposed, in combination with ts.sync(), similarly to CilkPlus, for example.
The only thing is that in between a ts.executeAsync and ts.sync(), there is no guarantee host data are consistent. Therefore the user has to be aware of that.
Concerning these functions:
CompletableFuture executeWithProfilerSequentialAsyn(Policy policy); // Not sure about "sequential async"
CompletableFuture executeWithProfilerSequentialGlobalAsync(Policy policy); // Not sure about "sequential async"
It should be possible to run them in Async as well. The only thing is that the granularity of "execution" is different. However, these calls are still experimental.
Details here: https://dl.acm.org/doi/10.1145/3313808.3313819
Do you want to work on this new feature? In that case, I suggest writing and sharing a short document with the high-level architecture plus examples. So we can help. We are also implementing a new API for TornadoVM, so we can take architecture decisions in mind to accommodate this.
Juan,
Is data consistent after regular TaskSchedle.execute()? If "yes", then how it can be inconsistent after asynchronous completion of the latest event in OpenCL pipeline (or commands queue) or PTX stream?
I can try to make a small proto with OpenCL, but not with CUDA - haven't worked with it.
I meant in between the executecuteAsync and the await. Here's an example of what I mean:
float[] input = ... // object allocated on the Java heap
float[] output = ... // object allocated on the Java heap
TaskSchedule ts = new TaskSchedule("example")
.streamIn(input)
.task("async", Klass::method, input, output)
.streamOut(output);
ts.executeAsync();
// output is not full-fill yet.
// The output content is not guaranteed in between these two calls
ts.await();
// content of output from the device is guaranteed at this point
So, the user needs to know that the full content is available on the Java heap only after the execution of the await method.
Starting with the OpenCL sounds good. Looking forward to it.
Juan,
I see your point. But once developers get to work with async pipelines created from CompletableFuture they never trap into such issues )
What concerns me more is the following:
When executing OpenCL / CUDA callback we have to attach gpu thread from JNI and invoke Java code. What developers wrongly and happily do is that they left intensive computation code on this "borrowed" thread. And this is BAD. I think that we should enforce explicit Executor to resume program once event is fired, i.e. API will be:
Executor executor = ....; //Some ExecutorService
TaskSchedule ts = new TaskSchedule(...)...;
CompletableFuture onTaskSheduleComplete = ts.executeAsync(executor);
Besides switching to the "correct" thread (non-gpu) we will provide a stronger hint that execution will be resumed later and on different thread, so using output right after call to execAsync is not eligible.
... or alternatively we might have 2 overloaded versions: the one with implicit executor, and the one with ForkJoinCommonPool passed explicitly.
Sure, developers still have good chances to shoot themselves in the leg with fake Executor or an executor with rejection policy run on current thread if queue is full. But the library at least does its best to avoid problems.
I am not sure I completely follow.
we have to attach gpu thread from JNI and invoke Java code.
You mean the Java thread that calls the GPU code? TornadoVM normally launches a grid of > 100x threads.
so using
outputright after call toexecAsyncis not eligible.
But it is still legal Java code.
Your proposal regarding the Executor looks reasonable. One thing we take care of is to avoid our API from returning different object types with the same functions. We want to keep the API as simple as possible. But I understand what you want to achieve.
I highly recommend creating a new document with examples and use cases to design this part. Perhaps a shared GDoc will do the job, so it is easier to suggest and add annotations.
I am not sure I completely follow.
we have to attach gpu thread from JNI and invoke Java code.
When doing callbacks from JNI code to Java code you have to do:
JavaVM* jvm;
...
JNIEnv* env;
jvm->AttachCurrentThread(&env, &args);
So we are attaching to the thread not created in Java, it's created by OpenCL. And without switching to "own" thread quickly we keep it busy for the duration of the event listener (and it could be significant time).
Shouldn't be easy to run within an executor, or future inside the TaskSchedule execute method, and make all OpenCL calls async?
I think there is no need for a call-back function, we can get the events and wait for the events in the last operation. TornadoVM uses a single OpenCL command queue at the moment.
When the async method is called, then, we can call clWaitForEvents for the last event, making it a blocking call.
I'm trying to implement some proto. What I'm observing in
@Override
public void waitOn() {
if (VM_USE_DEPS && event != null) {
event.waitOn();
} else {
executionContext.getDevices().forEach(TornadoDevice::sync);
}
}
i.e. by default it sync devices rather than wait on single event. So I set -Dtornado.vm.deps=true as command line argument to develop and test callback. Afterwards ALL and EVERY test fails!
The reason is implementation of CLEvent.waitOnPassive:
private void waitOnPassive() {
try {
internalBuffer[0] = 1;
internalBuffer[1] = oclEventID;
clWaitForEvents(internalBuffer);
} catch (OCLException e) {
error(e.getMessage());
}
}
Obviously, there is no such cl_event = 1 known to driver. Hence an ACCESS_VIOLATION.
TornadoVM uses a single OpenCL command queue at the moment.
??? I.e. one queue shared by all TaskSchedule? Then it makes no sense to introduce any asynchronous capabilities while runtime supports only serial execution and can't be used from multithreaded code at all. Is it a case?
It shouldn't be the case, from docs of TornadoVM:
/**
* There is an instance of the {@link TornadoVM} per
* {@link TornadoTaskSchedule}. Each TornadoVM contains the logic to orchestrate
* the execution on the parallel device (e.g., a GPU).
* /
Yes, it makes sense. TornadoVM orchestrates OpenCL execution on a single device (at the moment). All tasks that belong to the same task-schedule will be executed on the same device (e.g. on the same GPU).
What async execute means for us, is to make a non-blocking call when calling the execute method, and return to the Java thread immediately. This does not contradict how internally we dispatch OpenCL code, because this will be another Java thread, via an Executor for example.
Currently, TornadoVM first compiles all tasks to OpenCL C, then installs the code in the code-cache and build the OpenCL programs. For the dispatch, TornadoVM uses async calls when dispatching OpenCL code (transfers H->D, and kernel launch). Data transfers from D->H are blocking. My suggestion is to make all non-blocking calls. When a sync or await method is a call at the task-schedule level (ts.await()), we can sync using the last events.
I think it is easier to design if we have a more descriptive document with some examples.
The CLEventsWrapper already holds all events fired to command queue. So adding callbacks to them and coordinating them should do the async trick. Any reason why Device -> Host is blocking currently if you do sync on TornadoDevice anyway?
Also, I need to confirm several things:
TornadoDeviceContextis per-TaskSchedule(it's a bit different from OpenCL Context, that could be shared between several queues).- Command queue is per-
TaskSchedule
The
CLEventsWrapperalready holds all events fired to command queue. So adding callbacks to them and coordinating them should do theasynctrick. Any reason why Device -> Host is blocking currently if you dosynconTornadoDeviceanyway?
Only the last transfer is blocking. No, there is no specific reason. We plan to remove the sync after the execute for blocking calls. That's all
TornadoDeviceContextis per-TaskSchedule(it's a bit different from OpenCL Context, that could be shared between several queues).
That's correct
2. Command queue is per-
TaskSchedule
Currently yes. We have an internal branch with multiple command queues, But this is still experimental,
Only the last transfer is blocking. No, there is no specific reason. We plan to remove the sync after the execute for blocking calls.
Could you point me where I can adjust my local code to avoid this blocking call (maven module + class)? I'm a bit lost in the code structure. I've added experimental async support, but it looks that due to this blocking call I'm always running on finished queue...
For testing I've done 2 changes:
- Altered
uk.ac.manchester.tornado.runtime.TornadoVMto use non-blocking read inexecuteStreamOutBlocking(btw,executeStreamOuthere uses blocking call onTornadoContextDevice- it looks like an error). - Altered
uk.ac.manchester.tornado.drivers.opencl.runtime.OCLTornadoDeviceto handle atomics instreamOutlike instreamOutBlocking.
This way I managed to run almost all tests (except one with profiler and one with character array). Even my tests with prototype of executeAsync works (waiting on callback of enquequeMarker without finishing queue).
However, it works only when -Dtornado.vm.deps=true flag IS NOT set. When it is set there is a different execution path, and even regular synchronous execute leads to failing tests - no exceptions, result is wrong.
Could you point me where I can adjust my local code to avoid this blocking call (maven module + class)? I'm a bit lost in the code structure. I've added experimental async support, but it looks that due to this blocking call I'm always running on finished queue...
There are two main points: At the task-schedule level and the TornadoVM level
- Task-Schedule: https://github.com/beehive-lab/TornadoVM/blob/master/runtime/src/main/java/uk/ac/manchester/tornado/runtime/tasks/TornadoTaskSchedule.java#L605
- TornadoVM: https://github.com/beehive-lab/TornadoVM/blob/master/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVM.java#L665
For testing I've done 2 changes:
- Altered
uk.ac.manchester.tornado.runtime.TornadoVMto use non-blocking read inexecuteStreamOutBlocking(btw,executeStreamOuthere uses blocking call onTornadoContextDevice- it looks like an error).- Altered
uk.ac.manchester.tornado.drivers.opencl.runtime.OCLTornadoDeviceto handle atomics instreamOutlike instreamOutBlocking.This way I managed to run almost all tests (except one with profiler and one with character array). Even my tests with prototype of
executeAsyncworks (waiting on callback ofenquequeMarkerwithout finishing queue).However, it works only when
-Dtornado.vm.deps=trueflag IS NOT set. When it is set there is a different execution path, and even regular synchronousexecuteleads to failing tests - no exceptions, result is wrong.
Looks like good progress. IMO, the async mode needs vm.deps to be enabled (True) by default. Otherwise, we can't track events. If you are not getting correct results, feel free to open an issue and provide the logs. It might be a bug in that part.
There are two main points: At the task-schedule level and the TornadoVM level
Yep, found it. There are were many other issues that I had to fix when trying to add true out-of-order execution (even inside JNI code - blocking wait for read/write).
Attached is a patch with fixed runtime + OpenCL, and an API for executeAsync(...)
Currently with OpenCL all tests run ok with both -Dtornado.ooo-execution.enable=true (ooo) and -Dtornado.ooo-execution.enable=false (partly blocking) on NVIDIA OpenCL.
On Intel OpenCL everything is ok for -Dtornado.ooo-execution.enable=false but with -Dtornado.ooo-execution.enable=true I get Segmentation Fault for just everything. Need someone who can debug code and find out the reason.
It would be great if anyone will apply this patch to current develop branch and run tests (for both settings of tornado.ooo-execution.enable) on AMD or other device.
0001-Enable-full-Out-of-order-execution-Adding-executeAsy.zip
Re-tested without my modifications on version 0.8:
$ tornado --printBytecodes --debug -Dtornado.ooo-execution.enable=true -Ds0.t0.device=0:1 uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 128
Computing MxM of 128x128
task info: s0.t0
platform : Intel(R) OpenCL
device : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz CL_DEVICE_TYPE_CPU (available)
dims : 2
global work offset: [0, 0]
global work size : [12, 1]
local work size : null
/d/HUPLOAD/0.8/TornadoVM/bin/bin/tornado: line 269: 1367 Segmentation fault ${JAVA_CMD} ${JAVA_FLAGS} $@
vsilaev@P6T7_WS MINGW64 /d/HUPLOAD/0.8/TornadoVM
So with OOO enabled Intel OpenCL 2.0 fails even with existed code, it's not me who introduced the issue ))
Hi @vsilaev , thanks for the report. I cannot reproduce the error on my Linux for any of the devices (Intel CPU, Intel HD Graphics, and NVIDIA GPU). This may be an error in Windows. Please, open a new issue with this error. We will try to work on this once we merge the Windows support.
There are two main points: At the task-schedule level and the TornadoVM level
Yep, found it. There are were many other issues that I had to fix when trying to add true out-of-order execution (even inside JNI code - blocking wait for read/write). Attached is a patch with fixed runtime + OpenCL, and an API for
executeAsync(...)Currently with OpenCL all tests run ok with both
-Dtornado.ooo-execution.enable=true(ooo) and-Dtornado.ooo-execution.enable=false(partly blocking) on NVIDIA OpenCL.On Intel OpenCL everything is ok for
-Dtornado.ooo-execution.enable=falsebut with-Dtornado.ooo-execution.enable=trueI getSegmentation Faultfor just everything. Need someone who can debug code and find out the reason.It would be great if anyone will apply this patch to current
developbranch and run tests (for both settings oftornado.ooo-execution.enable) on AMD or other device.0001-Enable-full-Out-of-order-execution-Adding-executeAsy.zip
Thank you, can you please open a PR to develop with this content? We will have a look from 1st Feb.
Thank you, can you please open a PR to
developwith this content? We will have a look from 1st Feb.
Well, actually no need to do this. After 8 days of investigation I come to conclusion that this is fundamentally impossible due to one TornadoVM future: it uses Java heap rather than DirectBuffer-s for data transfers. And it's impossible to get lock asynchronously to heap-based arrays. Period.
Please check OCLCommandQueue.cpp#transferFromHostToDevice and OCLCommandQueue.cpp#transferFromDeviceToHost
Both read and writes has clWaitForEvents(1, &writeEvent) and clWaitForEvents(1, &readEvent) respectively. Effectively, this means blocking calls. No the other way around. I commented out clWaitForEvents and fixed events processing in Java - but I get sporadic errors in tests. Very rare. But this means the obvious thing: GC indeed evicted corresponding array. Obviously, they are accessed by OpenCL native code, but without blocking we get env->ReleasePrimitiveArrayCritical(hostArray, buffer, JNI_ABORT) invoked immediately and Java GC may reclaim memory before OpenCL done with it. Ta-da!
I tried to move env->ReleasePrimitiveArrayCritical(hostArray, buffer, JNI_ABORT) in clSetEventCallback handler - and get canonical GC Locker as it's described by Aleksey Shipilёv.
P.S. Here is a second patch, should be used after the first one in this thread. 0001-Trying-to-release-buffer-arrays-in-callback-JNI.zip
Ok, yet another try. The plan is following:
- Keep existing code with
GetPrimitiveArrayCritical/ReleasePrimitiveArrayCriticalintransferFromDeviceToHost/transferFromHostToDevice- to use direct reference to array memory on heap. - Instead of
clWaitForEventsbetween Get-Release create a JNINewGlobalReferenceto prevent object from being garbage collected andReleaseGlobalReferenceinclSetEventCallbackhandler.
Keep my flingers crossed...
Status update.
- Obviously, just keeping reference to avoid GC-ing buffer array doesn't help - its location in memory may be changed during GC.
- I rewrite
OCLCommandQueueto useDirectByteBufferfor non-blocking calls and left existing "array-copying" code only for blocked read/write. - Fixed issues with events processing in
TornadoVMandOCLTornadoDevice - Fixed issues in
OCLEventwith static buffer usage - Added OpenCL 1.1 compatible version of enqueue barrier / marker that works exactly the same as for OpenCL 1.2 (
OCLCommandQueueandOCLDeviceContext). - Finally, added code for
CompletableFuture TaskSchedule.executeAsync(...)
Tested on both NVIDIA and AMD. Almost all of the tests -- ~325 out of 331 -- run ok in blocking, default (mixed) and non-blocking modes. The only exceptions are:
Running test: testVectorChars ................ [FAILED]
\_[REASON] expected:<102> but was:<33126>
Sporadically happens on both AMD and NVIDIA in any mode (default, blocking, non-blocking).
102 = 0x0066
33126 = 0x8166
I guess this is smth. related to handling 2 bytes types. From tests I saw that you implemented special handling for single-byte type. Probably, two-bytes should be addressed as well. Because it's always some crap in one byte and correctly set second byte.
Running test: testProfilerEnabled ................ [FAILED]
\_[REASON] null
Always happens on both AMD and NVIDIA in non-blocking mode. Blocking and default modes are ok.
Running test: testComputePi ................ [FAILED]
\_[REASON] expected:<3.14> but was:<5.1518049240112305>
Always happens on AMD (any mode). Need to check on Tornado 0.8 -- probably this is not my regression at all.
And the asynchronous invocation itself (i.e. Tornado.executeAsync(...) works as expected.
Waiting for your approval of my previous PR, so I'll share these results.