TornadoVM icon indicating copy to clipboard operation
TornadoVM copied to clipboard

dev branch - PTX error 701 and 700 on Irregulars examples

Open jnorthrup opened this issue 3 years ago • 4 comments
trafficstars

carried over from https://github.com/beehive-lab/TornadoVM/discussions/120#discussioncomment-3137390

i am running Irregulars example and as linked above the result codes come up 701

when I change the source code with s/float/double/g and rebuild the error reported changes to 700

also from a fresh reboot just to be sure.

WARNING: Using incubator modules: jdk.incubator.vector, jdk.incubator.foreign
Size = 2000
[TornadoVM-PTX-JNI] ERROR : cuModuleLoadData -> Returned: 700
PTX to cubin JIT compilation failed! (700)
PTX JIT compilation failed!
Unable to compile task task XXX__GENERATED_REDUCE0.reduce_seq0 - rAdd
[[email protected]/uk.ac.manchester.tornado.drivers.ptx.runtime.PTXTornadoDevice.compileTask(PTXTornadoDevice.java:192), [email protected]/uk.ac.manchester.tornado.drivers.ptx.runtime.PTXTornadoDevice.installCode(PTXTornadoDevice.java:145), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.compileTaskFromBytecodeToBinary(TornadoVM.java:477), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:741), [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049), [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300), [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592), [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.scheduleWithReduction(ReduceTaskSchedule.java:577), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.rewriteTaskForReduceSkeleton(TornadoTaskSchedule.java:992), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.reduceAnalysis(TornadoTaskSchedule.java:1002), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1012), [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042), [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300), org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60), org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)]
        [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.compileTaskFromBytecodeToBinary(TornadoVM.java:481)
        [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:741)
        [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049)
        [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.scheduleWithReduction(ReduceTaskSchedule.java:577)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.rewriteTaskForReduceSkeleton(TornadoTaskSchedule.java:992)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.reduceAnalysis(TornadoTaskSchedule.java:1002)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1012)
        [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042)
        [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
        org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60)
        org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)
[TornadoVM-PTX-JNI] ERROR : cuStreamSynchronize -> Returned: 700
Result is not correct - iteration: 0 expected: 1011.7773048769373 but found: 1503.754977668702
Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] TornadoVM Bytecode not recognized
        at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.throwError(TornadoVM.java:650)
        at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:769)
        at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:221)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:720)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1049)
        at [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.ReduceTaskSchedule.executeExpression(ReduceTaskSchedule.java:592)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.runReduceTaskSchedule(TornadoTaskSchedule.java:987)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.analyzeSkeletonAndRun(TornadoTaskSchedule.java:1014)
        at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:1042)
        at [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:300)
        at org.bereft.greatexpenses.ReductionIrregular.run(ReductionIrregular.java:60)
        at org.bereft.greatexpenses.ReductionIrregular.main(ReductionIrregular.java:81)
[TornadoVM-PTX-JNI] ERROR : cuStreamDestroy -> Returned: 700
        [JNI] /vol/xfs01/work/TornadoVM/drivers/ptx-jni/target/linux-amd64-release/sources/source/PTXStream.cpp:188 in function: free_staging_area_list result = 700

script is

source ~/work/TornadoVM/source.sh
 
tornado --debug -Xmx9G -XX:+PrintFlagsFinal -XX:+UseFMA -XX:+UseNUMA    \
        -XX:-UseZGC -XX:-UseG1GC -XX:+UseParallelGC -XX:-UseShenandoahGC \
        -ea -XX:-UseCompressedOops      \
-cp "$PWD/target/classes:$PWD/target/lib/*" org.bereft.greatexpenses.ReductionIrregular

source is

package org.bereft.greatexpenses;

import uk.ac.manchester.tornado.api.TaskSchedule;
import uk.ac.manchester.tornado.api.annotations.Parallel;
import uk.ac.manchester.tornado.api.annotations.Reduce;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Random;
import java.util.stream.IntStream;

class ConfigurationReduce {

    public static final int MAX_ITERATIONS = 101;
}

class Stats {

    public static double computeMedian(ArrayList<Long> input) {
        Collections.sort(input);
        double middle = input.size() /2 ;
        if (input.size() % 2 == 1) {
            middle = (input.get(input.size() / 2) + input.get(input.size() / 2 - 1)) / 2 ;
        }
        return middle;
    }
}

public class /*package uk.ac.manchester.tornado.examples.reductions;*/ ReductionIrregular {

    private static void reducedoubles(double[] input, @Reduce double[] output) {
        for (@Parallel int i = 0; i < input.length; i++) {
            output[0] += input[i];
        }
    }

    private void run(final int inputSize) {

        double[] input = new double[inputSize];
        double[] result = new double[]{0.0f};
        Random r = new Random(101);

        //@formatter:off
        TaskSchedule task = new TaskSchedule("s0")
                .streamIn(input)
                .task("t0", ReductionIrregular::reducedoubles, input, result)
                .streamOut(result);
        //@formatter:on

        ArrayList<Long> timers = new ArrayList<>();
        for (int i = 0; i < ConfigurationReduce.MAX_ITERATIONS; i++) {

            IntStream.range(0, inputSize).parallel().forEach(idx -> {
                input[idx] = r.nextDouble();
            });
            double[] sequential = new double[1];
            reducedoubles(input, sequential);

            long start = System.nanoTime();
            task.execute();
            long end = System.nanoTime();
            timers.add((end - start));

            if (Math.abs(sequential[0] - result[0]) > 0.1f) {
                System.out.println("Result is not correct - iteration: " + i + " expected: " + sequential[0] + " but found: " + result[0]);
            } else {
                System.out.println("Iteration: " + i + " is correct");
            }
        }

        System.out.println("Median TotalTime: " + Stats.computeMedian(timers));

    }

    public static void main(String[] args) {
        int inputSize = 2000;
        if (args.length > 0) {
            inputSize = Integer.parseInt(args[0]);
        }
        System.out.println("Size = " + inputSize);
        new ReductionIrregular().run(inputSize);
    }
}

might be related to https://forums.developer.nvidia.com/t/cuda-error-in-executeinternal-700-an-illegal-memory-access-was-encountered/191948

jnorthrup avatar Jul 13 '22 14:07 jnorthrup

following up with a working opencl configuration but local code which is not succeeding on same driver.

i don't yet have a firm grip on the contracts and cautions but one thing i have learned is not to use short-circuit booleans or boolean arrays.

this code does not work in the same opencl env that the example works. :

(1621, 1.0)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2d6f135936, pid=3270757, tid=3270758
#
# JRE version: OpenJDK Runtime Environment (17.0.1+12) (build 17.0.1+12-39)
# Java VM: OpenJDK 64-Bit Server VM (17.0.1+12-39, mixed mode, tiered, jvmci, parallel gc, linux-amd64)
# Problematic frame:
# C  [libnvidia-opencl.so.1+0xdd936]
#
# Core dump will be written. Default location: Core dumps may be processed with "/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /vol/xfs01/work/mp/elsalvador/core.3270757)
#
# An error report file with more information is saved as:
# /vol/xfs01/work/mp/elsalvador/hs_err_pid3270757.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
^C^C^C/vol/xfs01/work/TornadoVM/bin/bin/tornado: line 324: 3270757 Aborted                 (core dumped) ${JAVA_CMD} ${JAVA_FLAGS} $@

the offending code

import pairwise.idiom.neat.Sim;
import uk.ac.manchester.tornado.api.TaskSchedule;
import uk.ac.manchester.tornado.api.annotations.Parallel;

public class TornadoEval {
  //enum  ;

    public static TaskSchedule schedule(int ub, int jeansSizes, byte[][] accums, float[][] impulses, double[] inputs, float[][] weights, Sim sim) {
//        int first = sim.indirectOutputRange.getFirst();
        int iLast = sim.inputRange.getLast();
        int bFirst = sim.biasRange.getFirst();
        int bLast = sim.biasRange.getLast();
        int first = sim.indirectHiddenRange.getFirst();
        int indirectHiddenRangeLast = sim.indirectHiddenRange.getLast();
        int indirectOutputRangeFirst = sim.indirectOutputRange.getFirst();
        int indirectOutputRangeLast = sim.indirectOutputRange.getLast();
        TaskSchedule ts = new TaskSchedule("s0");
        ts.streamIn( (Object) impulses );
        ts.task("t0", TornadoEval::extracted, ub, jeansSizes, accums, impulses, inputs, weights, iLast, bFirst, bLast, first, indirectHiddenRangeLast, indirectOutputRangeFirst, indirectOutputRangeLast);
        ts.streamOut((Object) impulses);
        ts.execute();
        ts.waitOn();
        return ts;
    }

    static float[][] extracted(int ub, int jeansSizes, byte[][] accums, float[][] impulses, double[] inputs, float[][] weights, int iLast, int bFirst, int bLast, int hFirst, int hLast, int oFirst, int oLast) {
        for (@Parallel int gx = 0; gx < ub; gx++) {
            var d = 0f;
            var wt = 0f;

            for (int jx = 0; jx < jeansSizes; jx++) {
                var coord = gx * jeansSizes + jx;
                var link = 0;
                int inx;
                if (jx < oFirst) inx = 0;
                else inx = bFirst;

                float addEnd;
                if (0 == accums[gx][jx])
                    addEnd = 0f;
                else
                    addEnd = impulses[gx][jx];



                if (jx < oFirst)
                /*process hidden nodes */
                    while (inx < iLast) {
                        d = (float) inputs[inx];
                        if (d != 0f)
                            if (Double.isFinite(d)) {
                                wt = weights[coord][inx];
                                if (wt != 0f)
                                    if (Double.isFinite(wt)) addEnd += d * wt;
                            }
                        inx++;
                    }
                else {
                    /* output nodes skip inputs*/
                    inx = bFirst;
                }

                //proc all nodes

                while (inx < bLast) {
                    /*apply bias to hidden+output*/
                    wt = weights[coord][inx];
                    if (Double.isFinite(wt)) addEnd = addEnd + wt; //FMA
                    inx++;
                }

                if (jx >= oFirst) {
                    /*outputNode Linear*/
                    while (inx < hLast) {
                        if (inx != jx) {
                            d = impulses[coord][inx - hFirst];
                            if (d != 0f)
                                if (Double.isFinite(d)) {
                                    link = inx;
                                    wt = weights[coord][link];
                                    if (wt != 0f)
                                        if (Double.isFinite(wt))
                                            addEnd = addEnd + d * wt;  //FMA
                                }
                        }
                        inx++;
                    }
                    impulses[gx][jx] = addEnd;
                } else {
                    /*perform LRELU on hidden node */
                if (addEnd <  -0.01f) impulses[gx][jx] = 0.01f;
                    else  impulses[gx][jx] =addEnd;                }
            }
        }
        return impulses;
    }
}

example code working on opencl below

[...]

Iteration: 99 is correct
Task info: XXX__GENERATED_REDUCE0.reduce_seq0
        Backend           : OPENCL
        Device            : NVIDIA GeForce RTX 3060 Ti CL_DEVICE_TYPE_GPU (available)
        Dims              : 0
        Global work offset: [0]
        Global work size  : [1]
        Local  work size  : [1, 1, 1]
        Number of workgroups  : [1]

Iteration: 100 is correct
Median TotalTime: 454706.0
jim@gentoo /vol/xfs01/work/mp/unrelated $ 

jnorthrup avatar Jul 15 '22 17:07 jnorthrup

Thanks for the report @jnorthrup .

Regarding the error during the kernel launch for the PTX Backend, we just open an issue ( #195 ) .We will work on it.

Regarding the OpenCL backend, I do not follow, Is it working for GPU and your examples?

jjfumero avatar Jul 20 '22 10:07 jjfumero

opencl works for the examples script on my gpu without ptx built in

jnorthrup avatar Jul 20 '22 16:07 jnorthrup

The PTX Backend has been fixed to launch the correct parameters with the latest drivers. However, some reductions still report wrong results. We will provide a fix for this.

Meanwhile, the OpenCL backend should work for the same GPUs (30XX) and latets NVIDIA Drivers.

jjfumero avatar Jul 26 '22 14:07 jjfumero

I finally got some time to look at the pending issues with the reductions. The thread-block was not set correctly. The following PR solves the issue: https://github.com/beehive-lab/TornadoVM/pull/210 This will be merged soon.

Thanks for all the reports.

jjfumero avatar Oct 12 '22 07:10 jjfumero