Halide
Halide copied to clipboard
Benchmarking local_laplacian segfaults
Using the adams2019 autoscheduler, on master. The following series of commands are run from apps/local_laplacian:
make clean
make bin/host/local_laplacian.generator
# Make a runtime
./bin/host/local_laplacian.generator -r runtime -o bin/host target=host
c++ -std=c++17 -O3 -c ../../tools/RunGenMain.cpp -o bin/RunGenMain.o -I ../../distrib/include -I /opt/local/include
mkdir -p results
HL_PERMIT_FAILED_UNROLL=1 \
HL_SEED=256 \
HL_RANDOM_DROPOUT=1 \
HL_BEAM_SIZE=1 \
./bin/host/local_laplacian.generator -g local_laplacian -e stmt,static_library,h,assembly,registration,compiler_log,llvm_assembly -o results -p ../../distrib/lib/libautoschedule_adams2019.dylib target=host-no_runtime-disable_llvm_loop_opt auto_schedule=true -s Adams2019
c++ -std=c++17 results/*.{cpp,a} bin/RunGenMain.o bin/host/runtime.a -I ../../distrib/include/ -L/opt/local/lib -ljpeg -lpng -ltiff -lpthread -ldl -o results/benchmark
results/benchmark --benchmark_min_time=0 --track_memory --benchmarks=all --default_input_buffers=random:0:estimate_then_auto --default_input_scalars --output_extents=estimate --parsable_output
Output:
rm -rf bin
c++ -O3 -std=c++17 -I /Users/alexanderroot/Projects/Halide-auto/distrib/include/ -I /Users/alexanderroot/Projects/Halide-auto/distrib/tools/ -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi -stdlib=libc++ -fvisibility=hidden local_laplacian_generator.cpp /Users/alexanderroot/Projects/Halide-auto/distrib/tools/GenGen.cpp -o bin/host/local_laplacian.generator -Wl,-rpath,/Users/alexanderroot/Projects/Halide-auto/distrib/lib/ -L /Users/alexanderroot/Projects/Halide-auto/distrib/lib/ -lHalide -L/usr/local/opt/llvm/lib -ldl -lpthread -lz -Wl,-force_load /Users/alexanderroot/Projects/Halide-auto/distrib/lib/libautoschedule_adams2019.dylib
generate_schedule for target=x86-64-osx-avx-avx2-disable_llvm_loop_opt-f16c-fma-no_runtime-sse41
Pass 0 of 1, cost: 84.0059, time (ms): 4636
Best cost: 84.0059
Cache (block) hits: 0
Cache (block) misses: 977
Warning:
Not folding Func f152 along dimension v1 because there is vectorized access to that Func in that dimension and storage folding was not explicitly requested in the schedule. In previous versions of Halide this would have folded with factor 8. To restore the old behavior add f152.fold_storage(v1, 8) to your schedule.
Warning:
Not folding Func f103 along dimension v1 because there is vectorized access to that Func in that dimension and storage folding was not explicitly requested in the schedule. In previous versions of Halide this would have folded with factor 32. To restore the old behavior add f103.fold_storage(v1, 32) to your schedule.
Warning:
Not folding Func f156 along dimension v1 because there is vectorized access to that Func in that dimension and storage folding was not explicitly requested in the schedule. In previous versions of Halide this would have folded with factor 8. To restore the old behavior add f156.fold_storage(v1, 8) to your schedule.
Warning:
HL_PERMIT_FAILED_UNROLL is allowing us to unroll a non-constant loop into a serial loop. Did you mean to do this?
Warning:
HL_PERMIT_FAILED_UNROLL is allowing us to unroll a non-constant loop into a serial loop. Did you mean to do this?
ld: warning: directory not found for option '-L/opt/local/lib'
Warning: Using --track_memory with --benchmarks will produce inaccurate benchmark results.
./error.sh: line 21: 46537 Segmentation fault: 11 results/benchmark --benchmark_min_time=0 --track_memory --benchmarks=all --default_input_buffers=random:0:estimate_then_auto --default_input_scalars --output_extents=estimate --parsable_output
LLDB output:
(lldb) run
Process 46559 launched: '/Users/alexanderroot/Projects/Halide-auto/apps/local_laplacian/results/benchmark' (x86_64)
Warning: Using --track_memory with --benchmarks will produce inaccurate benchmark results.
Process 46559 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010000ee33 benchmark`local_laplacian.par_for.output.s0.v1.v1 + 6691
benchmark`local_laplacian.par_for.output.s0.v1.v1:
-> 0x10000ee33 <+6691>: vmovaps 0x1f28(%rsp), %ymm1
0x10000ee3c <+6700>: vmovaps 0x1f40(%rsp), %ymm2
0x10000ee45 <+6709>: vmovaps 0x1f48(%rsp), %ymm3
0x10000ee4e <+6718>: vshufps $0xdd, %ymm3, %ymm1, %ymm4 ; ymm4 = ymm1[1,3],ymm3[1,3],ymm1[5,7],ymm3[5,7]
thread #9, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010000ee33 benchmark`local_laplacian.par_for.output.s0.v1.v1 + 6691
benchmark`local_laplacian.par_for.output.s0.v1.v1:
-> 0x10000ee33 <+6691>: vmovaps 0x1f28(%rsp), %ymm1
0x10000ee3c <+6700>: vmovaps 0x1f40(%rsp), %ymm2
0x10000ee45 <+6709>: vmovaps 0x1f48(%rsp), %ymm3
0x10000ee4e <+6718>: vshufps $0xdd, %ymm3, %ymm1, %ymm4 ; ymm4 = ymm1[1,3],ymm3[1,3],ymm1[5,7],ymm3[5,7]
It's an aligned load from the stack. So either it's a stack overflow, or that address is not aligned. Assuming the stack pointer is aligned, that address is 8-byte aligned, which is not enough for a movaps. So this is a miscompilation. Perhaps we're emitting bad alignment info in CodeGen_LLVM?
This could be a bug in modulus_remainder.
Is this still active? Does it need investigation?