Runs forever on Hexagon Simulator if using .parallel()
I am testing the provided Hexagon Halide demos to compare performance of int vs float execution time, with .vectorize() enabled and disabled in the schedule respectively.
I am testing with the conv3x3a32 example. The schedule first tiles. In the schedule, I found if I replace .vectorize(xi) with .parallel(xi), the benchmark runs forever and does not stop. Or if I try to fuse xo and yo into a fused variable, then .parallel(fused) like from the Halide tutorials, it does not stop either. It was still running after several hours, so I conclude it must be stuck.
Is it expected behavior that computation gets stuck if .parallel() is used?
Programming Hexagon through the C++ SDK still allows float operations outside of HVX. Using floats through Hexagon without HVX computes just fine and finishes within a minute. While this is slower than it would be with HVX, it certainly does not take hours. Is there a way to match this performance on floats through Halide? The input is of reasonable size at 1920x1080.
Nevermind, I found out parallelism on the simulator was broken 2.5 years ago. #2108. Seems like it's still broken :(
We should update the README to warn people about this.
I don't think .parallel should hang on the simulator. #2108 is about the fact that we don't actually simulate parallelism, but the program should still be functionally correct and run (just without any speedup from parallelism).
I think this is a new issue.
Answers to some of the other questions in the original post:
- on Hexagon, you should typically try to vectorize the innermost loop and parallelize the outermost
- simulator should correctly run a schedule that contains parallel, just singled threaded (should not hang)
- when using the simulator, you should use smaller inputs than you would use on device due to the overhead of the simulator
- execution of Hexagon scalar instructions cannot be made to match that of HVX vector instructions
You might try running your test with a much smaller input (e.g. 128x128 or 128x16) just to see if it is truly hanging, or just taking a really long time (due to the parallelized inner loop).
Thanks Dan, I will try that and report back.
Update 4/13/2020: I tried it and it was just because the parallelized inner loop was taking a long time. I parallelized the outer loop instead (see comment below) and it finished running. However, I'm confused why it paralleization didn't make use of Hexagon's 4 hardware threads.
@dsharletg @pranavb-ca @dpalermo
I went and learned more about scheduling in Halide and more about Hexagon. This following schedule is for the conv3x3a16 Halide example in the Hexagon SDK 3.5.1.
I want to use all 4 available hardware threads on the Hexagon DSP.
So I tried this
Func(output)
.tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
.vectorize(xi)
.parallel(yi);
where vector_size is set to 128 for HVX 128.
My understanding is: I'm making 128 x 4 sized tiles. I'm vectorizing xi to take advantage of HVX. Then I parallel yi to take advantage of all 4 hardware threads on Hexagon.
However, the simulator is reporting that only one hardware thread is used, not 4, even though I used .parallel(yi).
T0: Insns=18185954 Packets=9686990
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=18185954 Pcycles=19838300
In addition, the parallel strategy is much slower than, this came out to run much slower than the original unroll strategy in the example:
Func(output)
.tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
.vectorize(xi)
.unroll(yi);
The unroll strategy came to 0.1579 cycles/pixel, whereas the parallel strategy came to 2.8886 cycles/pixel.
Can someone please help me understand why
- .parallel() did not use all 4 hardware threads
- why .parallel() is slower than .unroll()
Thanks!
I think it might be better to parallelize over y and not yi, something like this:
Func(output)
.tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
.vectorize(xi)
.parallel(y);
In your schedule loops are arranged like this:
for y:
for x
for yi <- parallelized
for xi: <-- vectorized
which is probably too fine-grained to parallelize efficiently, because of the overheads and so.
I think you can get even better schedule by doing both parallel() and unroll(), like this:
Func(output)
.tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
.vectorize(xi)
.unroll(yi)
.parallel(y);
@vksnk thanks for your comment!
Per your advice, I tried your suggested schedule, using unroll(yi) and .parallel(y). The hexagon simulator reports 0.2091 cycles/pixel, which is slower than the 0.1579 cycles/pixel of unroll(yi) alone.
Once again, the simulator reports that only 1 hardware thread is being used.
T0: Insns=3233950 Packets=1391755
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=3233950 Pcycles=3073589
This is my main confusion. Why is .parallel() not activating all 4 hardware threads on the Hexagon?
The Hexagon simulator uses a "fake" thread pool that doesn't actually use threads due to undiagnosed issues (#2108).
However, when running on a real Hexagon device, parallel uses a real thread pool and you should see increased performance.