Halide Runs forever on Hexagon Simulator if using .parallel()

I am testing the provided Hexagon Halide demos to compare performance of int vs float execution time, with .vectorize() enabled and disabled in the schedule respectively.

I am testing with the conv3x3a32 example. The schedule first tiles. In the schedule, I found if I replace .vectorize(xi) with .parallel(xi), the benchmark runs forever and does not stop. Or if I try to fuse xo and yo into a fused variable, then .parallel(fused) like from the Halide tutorials, it does not stop either. It was still running after several hours, so I conclude it must be stuck.

Is it expected behavior that computation gets stuck if .parallel() is used?

Programming Hexagon through the C++ SDK still allows float operations outside of HVX. Using floats through Hexagon without HVX computes just fine and finishes within a minute. While this is slower than it would be with HVX, it certainly does not take hours. Is there a way to match this performance on floats through Halide? The input is of reasonable size at 1920x1080.

Mar 09 '20 23:03 andrewjong

Nevermind, I found out parallelism on the simulator was broken 2.5 years ago. #2108. Seems like it's still broken :(

Mar 09 '20 23:03 andrewjong

We should update the README to warn people about this.

Mar 09 '20 23:03 steven-johnson

I don't think .parallel should hang on the simulator. #2108 is about the fact that we don't actually simulate parallelism, but the program should still be functionally correct and run (just without any speedup from parallelism).

I think this is a new issue.

Mar 09 '20 23:03 dsharletg

Answers to some of the other questions in the original post:

on Hexagon, you should typically try to vectorize the innermost loop and parallelize the outermost
simulator should correctly run a schedule that contains parallel, just singled threaded (should not hang)
when using the simulator, you should use smaller inputs than you would use on device due to the overhead of the simulator
execution of Hexagon scalar instructions cannot be made to match that of HVX vector instructions

You might try running your test with a much smaller input (e.g. 128x128 or 128x16) just to see if it is truly hanging, or just taking a really long time (due to the parallelized inner loop).

Mar 10 '20 15:03 dpalermo

Thanks Dan, I will try that and report back.

Update 4/13/2020: I tried it and it was just because the parallelized inner loop was taking a long time. I parallelized the outer loop instead (see comment below) and it finished running. However, I'm confused why it paralleization didn't make use of Hexagon's 4 hardware threads.

Mar 10 '20 18:03 andrewjong

@dsharletg @pranavb-ca @dpalermo

I went and learned more about scheduling in Halide and more about Hexagon. This following schedule is for the conv3x3a16 Halide example in the Hexagon SDK 3.5.1.

I want to use all 4 available hardware threads on the Hexagon DSP.

So I tried this

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(yi);

where vector_size is set to 128 for HVX 128.

My understanding is: I'm making 128 x 4 sized tiles. I'm vectorizing xi to take advantage of HVX. Then I parallel yi to take advantage of all 4 hardware threads on Hexagon.

However, the simulator is reporting that only one hardware thread is used, not 4, even though I used .parallel(yi).

T0: Insns=18185954 Packets=9686990
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=18185954 Pcycles=19838300

In addition, the parallel strategy is much slower than, this came out to run much slower than the original unroll strategy in the example:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi);

The unroll strategy came to 0.1579 cycles/pixel, whereas the parallel strategy came to 2.8886 cycles/pixel.

Can someone please help me understand why

.parallel() did not use all 4 hardware threads
why .parallel() is slower than .unroll()

Thanks!

Apr 14 '20 01:04 andrewjong

I think it might be better to parallelize over y and not yi, something like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(y);

In your schedule loops are arranged like this:

for y:
  for x
    for yi <- parallelized
       for xi: <-- vectorized

which is probably too fine-grained to parallelize efficiently, because of the overheads and so.

I think you can get even better schedule by doing both parallel() and unroll(), like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi)
  .parallel(y);

Apr 14 '20 01:04 vksnk

@vksnk thanks for your comment! Per your advice, I tried your suggested schedule, using unroll(yi) and .parallel(y). The hexagon simulator reports 0.2091 cycles/pixel, which is slower than the 0.1579 cycles/pixel of unroll(yi) alone.

Once again, the simulator reports that only 1 hardware thread is being used.

T0: Insns=3233950 Packets=1391755
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=3233950 Pcycles=3073589

This is my main confusion. Why is .parallel() not activating all 4 hardware threads on the Hexagon?

Apr 14 '20 01:04 andrewjong

The Hexagon simulator uses a "fake" thread pool that doesn't actually use threads due to undiagnosed issues (#2108).

However, when running on a real Hexagon device, parallel uses a real thread pool and you should see increased performance.

Apr 14 '20 04:04 dsharletg