TornadoVM
TornadoVM copied to clipboard
dev branch - PTX error 701 and 700 on Irregulars examples
carried over from https://github.com/beehive-lab/TornadoVM/discussions/120#discussioncomment-3137390
i am running Irregulars example and as linked above the result codes come up 701
when I change the source code with s/float/double/g and rebuild the error reported changes to 700
also from a fresh reboot just to be sure.
WARNING: Using incubator modules: jdk.incubator.vector, jdk.incubator.foreign
Size = 2000
[TornadoVM-PTX-JNI] ERROR : cuModuleLoadData -> Returned: 700
PTX to cubin JIT compilation failed! (700)
PTX JIT compilation failed!
Unable to compile task task XXX__GENERATED_REDUCE0.reduce_seq0 - rAdd
[[email protected]/uk.ac.manchester.tornado.drivers.ptx.runtime.PTXTornadoDevice.compileTask(PTXTornadoDevice.java:192), [email protected]/uk.ac.manchester.tornado.drivers.ptx.runtime.PTXTornadoDevice.installCode(PTXTornadoDevice.java:145), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.compileTaskFromBytecodeToBinary(TornadoVM.java:477), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:741), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049), [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300), [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592), [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.scheduleWithReduction(ReduceTaskSchedule.java:577), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.rewriteTaskForReduceSkeleton(TornadoTaskSchedule.java:992), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.reduceAnalysis(TornadoTaskSchedule.java:1002), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1012), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042), [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300), org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60), org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)]
[email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.compileTaskFromBytecodeToBinary(TornadoVM.java:481)
[email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:741)
[email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049)
[email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.scheduleWithReduction(ReduceTaskSchedule.java:577)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.rewriteTaskForReduceSkeleton(TornadoTaskSchedule.java:992)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.reduceAnalysis(TornadoTaskSchedule.java:1002)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1012)
[email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042)
[email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60)
org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)
[TornadoVM-PTX-JNI] ERROR : cuStreamSynchronize -> Returned: 700
Result is not correct - iteration: 0 expected: 1011.7773048769373 but found: 1503.754977668702
Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] TornadoVM Bytecode not recognized
at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.throwError(TornadoVM.java:650)
at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:769)
at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049)
at [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.runReduceTaskSchedule(TornadoTaskSchedule.java:987)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1014)
at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042)
at [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
at org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60)
at org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)
[TornadoVM-PTX-JNI] ERROR : cuStreamDestroy -> Returned: 700
[JNI] /vol/xfs01/work/TornadoVM/drivers/ptx-jni/target/linux-amd64-release/sources/source/PTXStream.cpp:188 in function: free_staging_area_list result = 700
script is
source ~/work/TornadoVM/source.sh
tornado --debug -Xmx9G -XX:+PrintFlagsFinal -XX:+UseFMA -XX:+UseNUMA \
-XX:-UseZGC -XX:-UseG1GC -XX:+UseParallelGC -XX:-UseShenandoahGC \
-ea -XX:-UseCompressedOops \
-cp "$PWD/target/classes:$PWD/target/lib/*" org.bereft.greatexpenses.ReductionIrregular
source is
package org.bereft.greatexpenses;
import uk.ac.manchester.tornado.api.TaskSchedule;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.annotations.Reduce;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Random;
import java.util.stream.IntStream;
class ConfigurationReduce {
public static final int MAX_ITERATIONS = 101;
}
class Stats {
public static double computeMedian(ArrayList<Long> input) {
Collections.sort(input);
double middle = input.size() /2 ;
if (input.size() % 2 == 1) {
middle = (input.get(input.size() / 2) + input.get(input.size() / 2 - 1)) / 2 ;
}
return middle;
}
}
public class /*package uk.ac.manchester.tornado.examples.reductions;*/ ReductionIrregular {
private static void reducedoubles(double[] input, @Reduce double[] output) {
for (@Parallel int i = 0; i < input.length; i++) {
output[0] += input[i];
}
}
private void run(final int inputSize) {
double[] input = new double[inputSize];
double[] result = new double[]{0.0f};
Random r = new Random(101);
//@formatter:off
TaskSchedule task = new TaskSchedule("s0")
.streamIn(input)
.task("t0", ReductionIrregular::reducedoubles, input, result)
.streamOut(result);
//@formatter:on
ArrayList<Long> timers = new ArrayList<>();
for (int i = 0; i < ConfigurationReduce.MAX_ITERATIONS; i++) {
IntStream.range(0, inputSize).parallel().forEach(idx -> {
input[idx] = r.nextDouble();
});
double[] sequential = new double[1];
reducedoubles(input, sequential);
long start = System.nanoTime();
task.execute();
long end = System.nanoTime();
timers.add((end - start));
if (Math.abs(sequential[0] - result[0]) > 0.1f) {
System.out.println("Result is not correct - iteration: " + i + " expected: " + sequential[0] + " but found: " + result[0]);
} else {
System.out.println("Iteration: " + i + " is correct");
}
}
System.out.println("Median TotalTime: " + Stats.computeMedian(timers));
}
public static void main(String[] args) {
int inputSize = 2000;
if (args.length > 0) {
inputSize = Integer.parseInt(args[0]);
}
System.out.println("Size = " + inputSize);
new ReductionIrregular().run(inputSize);
}
}
might be related to https://forums.developer.nvidia.com/t/cuda-error-in-executeinternal-700-an-illegal-memory-access-was-encountered/191948
following up with a working opencl configuration but local code which is not succeeding on same driver.
i don't yet have a firm grip on the contracts and cautions but one thing i have learned is not to use short-circuit booleans or boolean arrays.
this code does not work in the same opencl env that the example works. :
(1621, 1.0)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f2d6f135936, pid=3270757, tid=3270758
#
# JRE version: OpenJDK Runtime Environment (17.0.1+12) (build 17.0.1+12-39)
# Java VM: OpenJDK 64-Bit Server VM (17.0.1+12-39, mixed mode, tiered, jvmci, parallel gc, linux-amd64)
# Problematic frame:
# C [libnvidia-opencl.so.1+0xdd936]
#
# Core dump will be written. Default location: Core dumps may be processed with "/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /vol/xfs01/work/mp/elsalvador/core.3270757)
#
# An error report file with more information is saved as:
# /vol/xfs01/work/mp/elsalvador/hs_err_pid3270757.log
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
^C^C^C/vol/xfs01/work/TornadoVM/bin/bin/tornado: line 324: 3270757 Aborted (core dumped) ${JAVA_CMD} ${JAVA_FLAGS} $@
the offending code
import pairwise.idiom.neat.Sim;
import uk.ac.manchester.tornado.api.TaskSchedule;
import uk.ac.manchester.tornado.api.annotations.Parallel;
public class TornadoEval {
//enum ;
public static TaskSchedule schedule(int ub, int jeansSizes, byte[][] accums, float[][] impulses, double[] inputs, float[][] weights, Sim sim) {
// int first = sim.indirectOutputRange.getFirst();
int iLast = sim.inputRange.getLast();
int bFirst = sim.biasRange.getFirst();
int bLast = sim.biasRange.getLast();
int first = sim.indirectHiddenRange.getFirst();
int indirectHiddenRangeLast = sim.indirectHiddenRange.getLast();
int indirectOutputRangeFirst = sim.indirectOutputRange.getFirst();
int indirectOutputRangeLast = sim.indirectOutputRange.getLast();
TaskSchedule ts = new TaskSchedule("s0");
ts.streamIn( (Object) impulses );
ts.task("t0", TornadoEval::extracted, ub, jeansSizes, accums, impulses, inputs, weights, iLast, bFirst, bLast, first, indirectHiddenRangeLast, indirectOutputRangeFirst, indirectOutputRangeLast);
ts.streamOut((Object) impulses);
ts.execute();
ts.waitOn();
return ts;
}
static float[][] extracted(int ub, int jeansSizes, byte[][] accums, float[][] impulses, double[] inputs, float[][] weights, int iLast, int bFirst, int bLast, int hFirst, int hLast, int oFirst, int oLast) {
for (@Parallel int gx = 0; gx < ub; gx++) {
var d = 0f;
var wt = 0f;
for (int jx = 0; jx < jeansSizes; jx++) {
var coord = gx * jeansSizes + jx;
var link = 0;
int inx;
if (jx < oFirst) inx = 0;
else inx = bFirst;
float addEnd;
if (0 == accums[gx][jx])
addEnd = 0f;
else
addEnd = impulses[gx][jx];
if (jx < oFirst)
/*process hidden nodes */
while (inx < iLast) {
d = (float) inputs[inx];
if (d != 0f)
if (Double.isFinite(d)) {
wt = weights[coord][inx];
if (wt != 0f)
if (Double.isFinite(wt)) addEnd += d * wt;
}
inx++;
}
else {
/* output nodes skip inputs*/
inx = bFirst;
}
//proc all nodes
while (inx < bLast) {
/*apply bias to hidden+output*/
wt = weights[coord][inx];
if (Double.isFinite(wt)) addEnd = addEnd + wt; //FMA
inx++;
}
if (jx >= oFirst) {
/*outputNode Linear*/
while (inx < hLast) {
if (inx != jx) {
d = impulses[coord][inx - hFirst];
if (d != 0f)
if (Double.isFinite(d)) {
link = inx;
wt = weights[coord][link];
if (wt != 0f)
if (Double.isFinite(wt))
addEnd = addEnd + d * wt; //FMA
}
}
inx++;
}
impulses[gx][jx] = addEnd;
} else {
/*perform LRELU on hidden node */
if (addEnd < -0.01f) impulses[gx][jx] = 0.01f;
else impulses[gx][jx] =addEnd; }
}
}
return impulses;
}
}
example code working on opencl below
[...]
Iteration: 99 is correct
Task info: XXX__GENERATED_REDUCE0.reduce_seq0
Backend : OPENCL
Device : NVIDIA GeForce RTX 3060 Ti CL_DEVICE_TYPE_GPU (available)
Dims : 0
Global work offset: [0]
Global work size : [1]
Local work size : [1, 1, 1]
Number of workgroups : [1]
Iteration: 100 is correct
Median TotalTime: 454706.0
jim@gentoo /vol/xfs01/work/mp/unrelated $
Thanks for the report @jnorthrup .
Regarding the error during the kernel launch for the PTX Backend, we just open an issue ( #195 ) .We will work on it.
Regarding the OpenCL backend, I do not follow, Is it working for GPU and your examples?
opencl works for the examples script on my gpu without ptx built in
The PTX Backend has been fixed to launch the correct parameters with the latest drivers. However, some reductions still report wrong results. We will provide a fix for this.
Meanwhile, the OpenCL backend should work for the same GPUs (30XX) and latets NVIDIA Drivers.
I finally got some time to look at the pending issues with the reductions. The thread-block was not set correctly. The following PR solves the issue: https://github.com/beehive-lab/TornadoVM/pull/210 This will be merged soon.
Thanks for all the reports.