openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[CPU] brgconv on AVX2 enabling and switch OV deconv implement to ONEDNN deconv OP from conv_backward_data

Open luweizhou2016 opened this issue 11 months ago • 13 comments

Details:

  • Re-enable conv with brgemm implement on avx2 platform with fp32 , also heuristic cause by wavernn-upsampler regression
  • Switch OV cpu deconv implement from ONEDNN convbackward_data primitive to deconv primitive. Support const and non-const weight; Always expose planar weight layout to cpu graph; weight cache to save blocked const weight ; runtime reorder for none-const weight
  • enable bias fusing for fp32 to replace legacy depthwise fusing
  • enable deconv brgconv for fp32 and INT8 deconv
  • update test case

About test case/WA update:

  • OV has feature to set the output shape of conv_backward_data. Which would cause pad_r < 0 && pad_r + dil < 0, so applied the WA in onednn to create deconv OP
  • Deconv only has one ref entry(CPU_INSTANCE(ref_deconvolution_fwd_t) for JIT,gemm and c++ reference implement. So one layout can't use RTinfo to filter implement. Update test cases with 1x1 related or SSE related. For example: backward data has 2 entries when input is nChw16C : jit_avx512_common_1x1_convolution_bwd_data_f32_t jit_avx512_common_convolution_bwd_data_t to support So OV filter can work to get desired implement type. However, deconv only has only, it means that nChw16C would only get one implement type. So filter would not work.
  • Change stride to 1 unit because of brgemm limitation
  • subgraph concat deconv test can't hit ref implement because the gemm and ref share one entry implement.
  • Conformance test conv related failed, but can pass when merging 3.3 with brg. Failed caused by PR22972. Seems threshold changes. Also AMX and avv512 core would fail on these conformance test but not enabled on Azure platforms. Guess the expected result is based on avx2 jit. So current just remove these test

Tickets:

  • CVS-133120, CVS-122688

luweizhou2016 avatar Mar 21 '24 07:03 luweizhou2016

ONEDNN PR, https://github.com/openvinotoolkit/oneDNN/pull/244

luweizhou2016 avatar Apr 18 '24 08:04 luweizhou2016

@maxnick , can you pls help to review the code. Thx!

luweizhou2016 avatar Apr 18 '24 08:04 luweizhou2016

Deconv only has one ref entry(CPU_INSTANCE(ref_deconvolution_fwd_t) for JIT,gemm and c++ reference implement. So one layout can't use RTinfo to filter implement. Update test cases with 1x1 related or SSE related. For example: backward data has 2 entries when input is nChw16C : jit_avx512_common_1x1_convolution_bwd_data_f32_t jit_avx512_common_convolution_bwd_data_t to support So OV filter can work to get desired implement type. However, deconv only has only, it means that nChw16C would only get one implement type. So filter would not work.

That means that we have to focus mostly on the layout and on the implementation type only when we can das distinguish. For example: For nChw16c set CPUSpecificParams{{nChw16c}, {nChw16c}, {}, ""}, so that the implementation type isn't enforced but selected by oneDNN. The same is true for nChw8c and the plain layout. We shouldn't remove ref impl with the plain layout, as this may be required on other platforms (ARM). Also, obviously we have to add brgemm dedicated deconvolution tests and put nChw16c and nChw8c cases to the nightly scope for avx512 and avx2 convolutions. For deconvolutions we also have to put nChw16c and nChw8c cases to the nightly scope as well, but in that case sse4 will be put to nightly as well as now it's impossible to enforce sse4 and avx deconvolution implementations. As for 1x1 deconvolution, suggest reverting 1x1 filter size, but again, remove the implementation specification (i.e. "avx512_1x1" "avx2_1x1" etc.) for non brgemm scenarious.

maxnick avatar Apr 18 '24 09:04 maxnick

Change stride to 1 unit because of brgemm limitation

Probably should be set only for brgemm cases.

maxnick avatar Apr 18 '24 10:04 maxnick

subgraph concat deconv test can't hit ref implement because the gemm and ref share one entry implement.

The same as above, simply remove the implementation type from config instead of removing the planar layout at all.

maxnick avatar Apr 18 '24 10:04 maxnick

Deconv only has one ref entry(CPU_INSTANCE(ref_deconvolution_fwd_t) for JIT,gemm and c++ reference implement. So one layout can't use RTinfo to filter implement. Update test cases with 1x1 related or SSE related. For example: backward data has 2 entries when input is nChw16C : jit_avx512_common_1x1_convolution_bwd_data_f32_t jit_avx512_common_convolution_bwd_data_t to support So OV filter can work to get desired implement type. However, deconv only has only, it means that nChw16C would only get one implement type. So filter would not work.

That means that we have to focus mostly on the layout and on the implementation type only when we can das distinguish. For example: For nChw16c set CPUSpecificParams{{nChw16c}, {nChw16c}, {}, ""}, so that the implementation type isn't enforced but selected by oneDNN. The same is true for nChw8c and the plain layout. We shouldn't remove ref impl with the plain layout, as this may be required on other platforms (ARM). Also, obviously we have to add brgemm dedicated deconvolution tests and put nChw16c and nChw8c cases to the nightly scope for avx512 and avx2 convolutions. For deconvolutions we also have to put nChw16c and nChw8c cases to the nightly scope as well, but in that case sse4 will be put to nightly as well as now it's impossible to enforce sse4 and avx deconvolution implementations. As for 1x1 deconvolution, suggest reverting 1x1 filter size, but again, remove the implementation specification (i.e. "avx512_1x1" "avx2_1x1" etc.) for non brgemm scenarious.

Will apply in the update. Pls help to review other changes. Thx!

luweizhou2016 avatar Apr 22 '24 09:04 luweizhou2016

@maxnick , Applied review comments on test cases. pls help to review. Thx!

luweizhou2016 avatar Apr 23 '24 06:04 luweizhou2016

Change stride to 1 unit because of brgemm limitation

What happens if the stride is different from 1?

maxnick avatar Apr 23 '24 17:04 maxnick

Change stride to 1 unit because of brgemm limitation.

What exactly happens if the stride is other than 1?

maxnick avatar May 02 '24 17:05 maxnick

Change stride to 1 unit because of brgemm limitation.

What exactly happens if the stride is other than 1?

stride 1 would cause some cases fall back on jit with nhwc layout.

luweizhou2016 avatar May 07 '24 13:05 luweizhou2016

@maxnick , Applied review comments.

luweizhou2016 avatar May 07 '24 13:05 luweizhou2016

https://jira.devtools.intel.com/browse/CVS-140606 to track the acc issue on 1x1 kernel with details.

luweizhou2016 avatar May 08 '24 07:05 luweizhou2016

@maxnick , Applied review comment. Pls help to review. Thx!

luweizhou2016 avatar May 09 '24 05:05 luweizhou2016

PR is waiting for the internal validation results.

maxnick avatar May 14 '24 05:05 maxnick

PR is ready to be merged.

maxnick avatar May 15 '24 07:05 maxnick