Tensorization for avx2
Summary: Similar to other avx512 tensorization that reduces data:1x4 and kernel:16x4 to output:1x16, this PR introduces similar reduction using avx2 tensorization. It keeps the same API as avx512 so as to not have to introduce a new memory layout for weights.
Test Plan: on avx2 machine: python tests/python/contrib/test_gemm_avx2_acc32.py
NVM, just saw you other PR. :)
Aaah this one is messed up. My base branch was tensorize_fix. So it shows changes from that. Let me fix this.
Depends on this PR: https://github.com/facebookexperimental/tvm/pull/7
Benchmark number: Tensorization: running time: 25.363 ms, 84.67 Gops/s For m, n and k = 1024