Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN.

When the shape is Mx2 or Mx4 and N == ld_src, Mx2 and Mx4 transposes will achieve higher performance.
When the shape is 2xN or 4xN and M == ld_dst, 2xN and 4xN transposes will achieve higher performance.

Jun 26 '22 09:06 CaoE

Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name	Link
Latest commit	d5621e01cb64ff9b023b0d3bb8aa22bd93ecf11e
Latest deploy log	https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/630cff6ed481d800088dc51b

Jun 26 '22 09:06 netlify[bot]

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Jul 06 '22 03:07 facebook-github-bot

There are still some optimizations to be done. For eaxmple, reduce the mixing of avx512 and avx2 instructions. Sorry for the inconvenience.

Jul 06 '22 05:07 CaoE

hi @jianyuh I found that the parameter data types of transpose_avx512 and transpose_simd are not aligned.

template <typename T>
FBGEMM_API void transpose_simd(
    unsigned M,
    unsigned N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

void transpose_avx512(
    int64_t M,
    int64_t N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

In pytorch, there is also the usage of fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst) and the type of M, N, ld_src, ld_dst is int64_t. Do we need to make the parameter types consistent ?

Jul 25 '22 02:07 CaoE

hi @jianyuh I found that the parameter data types of transpose_avx512 and transpose_simd are not aligned.
template <typename T>
FBGEMM_API void transpose_simd(
    unsigned M,
    unsigned N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

void transpose_avx512(
    int64_t M,
    int64_t N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);
In pytorch, there is also the usage of fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst) and the type of M, N, ld_src, ld_dst is int64_t. Do we need to make the parameter types consistent ?

transpose_simd

@CaoE Thanks for identifying this issue! Ideally we should make the types of M, N, ld_src, ld_dst consistent. Feel free to update the PR to fix this, or we can follow up and fix it later. cc @jiyuanzFB

Aug 15 '22 14:08 jianyuh

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 15 '22 14:08 facebook-github-bot

Hi @CaoE, thanks for adding the optimized transpose to fbgemm. Our internal test reports a memory leak issue on TransposeTest with this patch. Could you please take a look? Thanks.

Failed to parse output from the test suite: Test binary probably crashed during execution, original error: IO error: No result xml files found. Execution directory: /tmp/tpx-20220816-144609.067435-efd55345-7a2d-4e32-9c9c-110bdafeec7e/2bca7546-f38a-44b2-8116-0f2230713ecd stdout: Note: Google Test filter = TransposeTest.TransposeTest [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TransposeTest [ RUN ] TransposeTest.TransposeTest

stderr:

==6336==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6110000086d8 at pc 0x7f24ddb01d34 bp 0x7ffcb8155770 sp 0x7ffcb8155768 READ of size 4 at 0x6110000086d8 thread T0 SCARINESS: 17 (4-byte-read-heap-buffer-overflow) #0 0x7f24ddb01d33 in fbgemm::internal::transpose_contiguous_32x4_block(unsigned short const*, unsigned short*, int, int) deeplearning/fbgemm/src/UtilsAvx512.cc:1070 #1 0x7f24ddaf9ddd in void fbgemm::internal::transpose_avx512_contiguous_thin(long, long, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/UtilsAvx512.cc:1861 #2 0x7f24ddbc149a in void fbgemm::internal::transpose_avx512(long, long, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/UtilsAvx512.cc:2331 #3 0x7f24dfbe0818 in void fbgemm::transpose_simd(unsigned int, unsigned int, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/TransposeUtils.cc:50 #4 0x2f4515 in TransposeTest_TransposeTest_Test::TestBody() deeplearning/fbgemm/test/TransposeTest.cc:124 #5 0x7f24d9c9aad5 in testing::Test::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2682:50 #6 0x7f24d9c9aad5 in testing::Test::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2672:6 #7 0x7f24d9c9ac64 in testing::TestInfo::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2861:14 #8 0x7f24d9c9ac64 in testing::TestInfo::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2833:6 #9 0x7f24d9c9b321 in testing::TestSuite::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:3015:31 #10 0x7f24d9c9b321 in testing::TestSuite::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2993:6 #11 0x7f24d9c9bb1e in testing::internal::UnitTestImpl::RunAllTests() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:5855:47 #12 0x7f24d9c9ad87 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::)(), char const) /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2665:29 #13 0x7f24d9c9ad87 in testing::UnitTest::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:5438:55 #14 0x7f24dfe7eae0 in RUN_ALL_TESTS() third-party-buck/platform010/build/googletest/include/gtest/gtest.h:2490 #15 0x7f24dfe7e758 in main common/gtest/LightMain.cpp:20 #16 0x7f24d9528656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #17 0x7f24d9528717 in __libc_start_main_impl /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3 #18 0x2bfe70 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

Aug 17 '22 17:08 jiyuanzFB

hi @jiyuanzFB could you please share the shapes and test environment ? what is the difference between the CI and internal test environment? I can not reproduce the issue in my local environment. Thanks.

Aug 18 '22 02:08 CaoE

Make the types of M, N, ld_src, ld_dst consistent.

Aug 25 '22 07:08 CaoE

hi @jianyuh @jiyuanzFB Could you try if the internal UT error will be fixed. Thank you very much.

Aug 25 '22 08:08 CaoE

@jiyuanzFB has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Aug 29 '22 18:08 facebook-github-bot

@CaoE Thanks a lot for the awesome contribution!

Aug 30 '22 21:08 jianyuh

FBGEMM
FBGEMM copied to clipboard

Add Mx2, Mx4, 2xN, and 4xN avx512 transpose

Deploy Preview for eclectic-stroopwafel-199537 canceled.

stderr:

FBGEMM FBGEMM copied to clipboard

Add Mx2, Mx4, 2xN, and 4xN avx512 transpose

✅ Deploy Preview for eclectic-stroopwafel-199537 canceled.

stderr:

FBGEMM
FBGEMM copied to clipboard

Deploy Preview for eclectic-stroopwafel-199537 canceled.