Internal Error at .../anderson2021/SearchSpace.cpp:486 ... Condition failed: !parallel_tilings.empty(): zero parallel tilings
repro.py:
import halide as hl
@hl.generator(name="kernel")
class Kernel:
in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
in_ptr1 = hl.InputBuffer(hl.Float(32), 1)
out_ptr0 = hl.OutputBuffer(hl.Float(32), 1)
def generate(g):
in_ptr0 = g.in_ptr0
in_ptr1 = g.in_ptr1
out_ptr0 = g.out_ptr0
tmp0 = in_ptr0[0]
tmp1 = in_ptr1[0]
tmp2 = tmp0 + tmp1
out_ptr0[hl.Var()] = tmp2
assert g.using_autoscheduler()
in_ptr0.set_estimates([hl.Range(0, 1)])
in_ptr1.set_estimates([hl.Range(0, 1)])
out_ptr0.set_estimates([hl.Range(0, 1)])
if __name__ == "__main__":
import sys, tempfile
with tempfile.TemporaryDirectory() as out:
sys.argv = ['repro.py', '-g', 'kernel', '-o', out, '-f', 'halide_kernel', '-e', 'static_library,h,schedule',
'-p', '/home/jansel/conda/envs/pytorch/lib/libautoschedule_anderson2021.so',
'target=host-cuda-cuda_capability_86-strict_float-no_asserts', 'autoscheduler=Anderson2021']
hl.main()
Note: you will need to update the path to libautoschedule_anderson2021.so for your system.
Output:
Unhandled exception: Internal Error at /home/jansel/Halide/src/autoschedulers/anderson2021/SearchSpace.cpp:486 triggered by user code at : Condition failed: !parallel_tilings.empty(): zero parallel tilings
Traceback (most recent call last):
File "/home/jansel/pytorch/repro.py", line 32, in <module>
hl.main()
RuntimeError: Generator failed: -1
This example is just adding two 1-element tensors.
Possible workarounds:
- Switch to Li2018 autoscheduler, which seems to work on this example. Any recommendation from the Halide folks here? I don't know much about the different schedulers.
- Increase
out_ptr0.set_estimatesfrom1to2(even though the real tensor is size 1). For some of the other schedulers (on CPU) I have gotten out of bounds access errors if I made the estimates larger than the actual value. Is doing this safe?
So this pipeline is a single scalar add operation?
I don't think any of us expected anyone to try to autoschedule a pipeline that does O(1) work. I think the appropriate schedule is gpu_single_thread(), but nobody taught the autoscheduler how to use that.
Yeah correct, it should be pretty trivial to schedule -- but it is a corner case the scheduler doesn't handle. This is coming from a unit test, but you occasionally have scalar operations (for example a learning rate update) in real models.
#8256 has a more complicated example (a reduction to a single element) with similar errors. Reductions to a single element often happen in things like layernorm or softmax. Those are harder to schedule, since you have very little parallelism at the very end. You either need atomics, syncs, or multiple kernels.