Possible heap corruption in qs8-dwconv-bench with primary_tile=25

Open ken-unger opened this issue 11 months ago • 1 comments

While implementing https://github.com/google/XNNPACK/pull/7638 and attempting to run qs8-dwconv-bench with xnn_qs8_dwconv_minmax_fp32_ukernel_25p8vc the benchmark encounters a malloc error after several tests. Running with 9p8vc is fine.

I noticed that the current qs8-dwconv-bench only uses primary_tile = 9 for its scalar benchmarks. Adding a benchmark test to include the scalar primary_tile=25 kernel results in the same apparent heap corruption.

Add to bench/qs8-dwconv.cc

static void qs8_dwconv_25p4c__scalar_lrintf(benchmark::State& state, const char* net) {
  DWConvBenchmark(state,
    xnn_qs8_dwconv_minmax_fp32_ukernel_25p4c__scalar_lrintf,
    xnn_init_qs8_conv_minmax_fp32_scalar_params,
    4 /* channel tile */, 25 /* primary tile */);
}

BENCHMARK_DWCONV(qs8_dwconv_25p4c__scalar_lrintf);

Test result.

./qs8-dwconv-bench --benchmark_filter=qs8_dwconv_25p4c__scalar_lrintf
2025-01-07T18:34:14-08:00
Running ./qs8-dwconv-bench
Run on (8 X 1600 MHz CPU s)
CPU Caches:
  L1 Instruction 32 KiB (x8)
  L1 Data 32 KiB (x8)
  L2 Unified 512 KiB (x2)
Load Average: 2.04, 1.65, 0.80
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
qs8_dwconv_25p4c__scalar_lrintf/mobilenet_v1/H:112/W:112/KH:3/KW:3/PH:2/PW:2/S:1/D:1/G:32/real_time         45615502 ns     45204284 ns           15 OPS=158.397M/s bytes=17.6088M/s cpufreq=1.6G
malloc(): invalid size (unsorted)
Aborted

I'm not clear if the test case is invalid here, or if there is a bug within DWConvBenchmark.

Jan 08 '25 03:01 ken-unger

I think this is a case of invalid parameters for 5x5. In practice 5x5 is used by mobilenet v3, while 3x3 is used in mobilenet v2.

So you could try mobilenet v3 Taking a quick look at the current models/benchmark --benchmark_filter=V3 FP32MobileNetV3Large/real_time 5602 us 5601 us 125 cpufreq=3.3723G FP32MobileNetV3Small/real_time 1722 us 1722 us 405 cpufreq=3.30428G FP16MobileNetV3Large/real_time 14207 us 14200 us 49 cpufreq=3.4632G FP16MobileNetV3Small/real_time 4880 us 4879 us 146 cpufreq=3.5469G The QS8 model is missing. The old end2end had it if you dig up old versions. TFLite benchmark_model can do a .tflite file if you can get a mobilenet v3 model.

Jan 23 '25 00:01 fbarchard