Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Runs forever on Hexagon Simulator if using .parallel()

Open andrewjong opened this issue 5 years ago • 9 comments

I am testing the provided Hexagon Halide demos to compare performance of int vs float execution time, with .vectorize() enabled and disabled in the schedule respectively.

I am testing with the conv3x3a32 example. The schedule first tiles. In the schedule, I found if I replace .vectorize(xi) with .parallel(xi), the benchmark runs forever and does not stop. Or if I try to fuse xo and yo into a fused variable, then .parallel(fused) like from the Halide tutorials, it does not stop either. It was still running after several hours, so I conclude it must be stuck.

Is it expected behavior that computation gets stuck if .parallel() is used?

Programming Hexagon through the C++ SDK still allows float operations outside of HVX. Using floats through Hexagon without HVX computes just fine and finishes within a minute. While this is slower than it would be with HVX, it certainly does not take hours. Is there a way to match this performance on floats through Halide? The input is of reasonable size at 1920x1080.

andrewjong avatar Mar 09 '20 23:03 andrewjong

Nevermind, I found out parallelism on the simulator was broken 2.5 years ago. #2108. Seems like it's still broken :(

andrewjong avatar Mar 09 '20 23:03 andrewjong

We should update the README to warn people about this.

steven-johnson avatar Mar 09 '20 23:03 steven-johnson

I don't think .parallel should hang on the simulator. #2108 is about the fact that we don't actually simulate parallelism, but the program should still be functionally correct and run (just without any speedup from parallelism).

I think this is a new issue.

dsharletg avatar Mar 09 '20 23:03 dsharletg

Answers to some of the other questions in the original post:

  • on Hexagon, you should typically try to vectorize the innermost loop and parallelize the outermost
  • simulator should correctly run a schedule that contains parallel, just singled threaded (should not hang)
  • when using the simulator, you should use smaller inputs than you would use on device due to the overhead of the simulator
  • execution of Hexagon scalar instructions cannot be made to match that of HVX vector instructions

You might try running your test with a much smaller input (e.g. 128x128 or 128x16) just to see if it is truly hanging, or just taking a really long time (due to the parallelized inner loop).

dpalermo avatar Mar 10 '20 15:03 dpalermo

Thanks Dan, I will try that and report back.

Update 4/13/2020: I tried it and it was just because the parallelized inner loop was taking a long time. I parallelized the outer loop instead (see comment below) and it finished running. However, I'm confused why it paralleization didn't make use of Hexagon's 4 hardware threads.

andrewjong avatar Mar 10 '20 18:03 andrewjong

@dsharletg @pranavb-ca @dpalermo

I went and learned more about scheduling in Halide and more about Hexagon. This following schedule is for the conv3x3a16 Halide example in the Hexagon SDK 3.5.1.

I want to use all 4 available hardware threads on the Hexagon DSP.

So I tried this

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(yi);

where vector_size is set to 128 for HVX 128.

My understanding is: I'm making 128 x 4 sized tiles. I'm vectorizing xi to take advantage of HVX. Then I parallel yi to take advantage of all 4 hardware threads on Hexagon.

However, the simulator is reporting that only one hardware thread is used, not 4, even though I used .parallel(yi).

T0: Insns=18185954 Packets=9686990
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=18185954 Pcycles=19838300

In addition, the parallel strategy is much slower than, this came out to run much slower than the original unroll strategy in the example:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi);

The unroll strategy came to 0.1579 cycles/pixel, whereas the parallel strategy came to 2.8886 cycles/pixel.

Can someone please help me understand why

  1. .parallel() did not use all 4 hardware threads
  2. why .parallel() is slower than .unroll()

Thanks!

andrewjong avatar Apr 14 '20 01:04 andrewjong

I think it might be better to parallelize over y and not yi, something like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(y);

In your schedule loops are arranged like this:

for y:
  for x
    for yi <- parallelized
       for xi: <-- vectorized

which is probably too fine-grained to parallelize efficiently, because of the overheads and so.

I think you can get even better schedule by doing both parallel() and unroll(), like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi)
  .parallel(y);

vksnk avatar Apr 14 '20 01:04 vksnk

@vksnk thanks for your comment! Per your advice, I tried your suggested schedule, using unroll(yi) and .parallel(y). The hexagon simulator reports 0.2091 cycles/pixel, which is slower than the 0.1579 cycles/pixel of unroll(yi) alone.

Once again, the simulator reports that only 1 hardware thread is being used.

T0: Insns=3233950 Packets=1391755
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=3233950 Pcycles=3073589

This is my main confusion. Why is .parallel() not activating all 4 hardware threads on the Hexagon?

andrewjong avatar Apr 14 '20 01:04 andrewjong

The Hexagon simulator uses a "fake" thread pool that doesn't actually use threads due to undiagnosed issues (#2108).

However, when running on a real Hexagon device, parallel uses a real thread pool and you should see increased performance.

dsharletg avatar Apr 14 '20 04:04 dsharletg