FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

Add Mx2, Mx4, 2xN, and 4xN avx512 transpose

Open CaoE opened this issue 2 years ago • 6 comments

Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN.

  • When the shape is Mx2 or Mx4 and N == ld_src, Mx2 and Mx4 transposes will achieve higher performance.
  • When the shape is 2xN or 4xN and M == ld_dst, 2xN and 4xN transposes will achieve higher performance.

CaoE avatar Jun 26 '22 09:06 CaoE

Deploy Preview for eclectic-stroopwafel-199537 canceled.

Name Link
Latest commit d5621e01cb64ff9b023b0d3bb8aa22bd93ecf11e
Latest deploy log https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/630cff6ed481d800088dc51b

netlify[bot] avatar Jun 26 '22 09:06 netlify[bot]

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jul 06 '22 03:07 facebook-github-bot

There are still some optimizations to be done. For eaxmple, reduce the mixing of avx512 and avx2 instructions. Sorry for the inconvenience.

CaoE avatar Jul 06 '22 05:07 CaoE

hi @jianyuh I found that the parameter data types of transpose_avx512 and transpose_simd are not aligned.

template <typename T>
FBGEMM_API void transpose_simd(
    unsigned M,
    unsigned N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

void transpose_avx512(
    int64_t M,
    int64_t N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

In pytorch, there is also the usage of fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst) and the type of M, N, ld_src, ld_dst is int64_t. Do we need to make the parameter types consistent ?

CaoE avatar Jul 25 '22 02:07 CaoE

hi @jianyuh I found that the parameter data types of transpose_avx512 and transpose_simd are not aligned.

template <typename T>
FBGEMM_API void transpose_simd(
    unsigned M,
    unsigned N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

void transpose_avx512(
    int64_t M,
    int64_t N,
    const T* src,
    unsigned ld_src,
    T* dst,
    unsigned ld_dst);

In pytorch, there is also the usage of fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst) and the type of M, N, ld_src, ld_dst is int64_t. Do we need to make the parameter types consistent ?

transpose_simd

@CaoE Thanks for identifying this issue! Ideally we should make the types of M, N, ld_src, ld_dst consistent. Feel free to update the PR to fix this, or we can follow up and fix it later. cc @jiyuanzFB

jianyuh avatar Aug 15 '22 14:08 jianyuh

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 15 '22 14:08 facebook-github-bot

Hi @CaoE, thanks for adding the optimized transpose to fbgemm. Our internal test reports a memory leak issue on TransposeTest with this patch. Could you please take a look? Thanks.

Failed to parse output from the test suite: Test binary probably crashed during execution, original error: IO error: No result xml files found. Execution directory: /tmp/tpx-20220816-144609.067435-efd55345-7a2d-4e32-9c9c-110bdafeec7e/2bca7546-f38a-44b2-8116-0f2230713ecd stdout: Note: Google Test filter = TransposeTest.TransposeTest [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TransposeTest [ RUN ] TransposeTest.TransposeTest

stderr:

==6336==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6110000086d8 at pc 0x7f24ddb01d34 bp 0x7ffcb8155770 sp 0x7ffcb8155768 READ of size 4 at 0x6110000086d8 thread T0 SCARINESS: 17 (4-byte-read-heap-buffer-overflow) #0 0x7f24ddb01d33 in fbgemm::internal::transpose_contiguous_32x4_block(unsigned short const*, unsigned short*, int, int) deeplearning/fbgemm/src/UtilsAvx512.cc:1070 #1 0x7f24ddaf9ddd in void fbgemm::internal::transpose_avx512_contiguous_thin(long, long, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/UtilsAvx512.cc:1861 #2 0x7f24ddbc149a in void fbgemm::internal::transpose_avx512(long, long, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/UtilsAvx512.cc:2331 #3 0x7f24dfbe0818 in void fbgemm::transpose_simd(unsigned int, unsigned int, unsigned short const*, unsigned int, unsigned short*, unsigned int) deeplearning/fbgemm/src/TransposeUtils.cc:50 #4 0x2f4515 in TransposeTest_TransposeTest_Test::TestBody() deeplearning/fbgemm/test/TransposeTest.cc:124 #5 0x7f24d9c9aad5 in testing::Test::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2682:50 #6 0x7f24d9c9aad5 in testing::Test::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2672:6 #7 0x7f24d9c9ac64 in testing::TestInfo::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2861:14 #8 0x7f24d9c9ac64 in testing::TestInfo::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2833:6 #9 0x7f24d9c9b321 in testing::TestSuite::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:3015:31 #10 0x7f24d9c9b321 in testing::TestSuite::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2993:6 #11 0x7f24d9c9bb1e in testing::internal::UnitTestImpl::RunAllTests() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:5855:47 #12 0x7f24d9c9ad87 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::)(), char const) /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:2665:29 #13 0x7f24d9c9ad87 in testing::UnitTest::Run() /home/engshare/third-party2/googletest/1.11.0/src/googletest/googletest/src/gtest.cc:5438:55 #14 0x7f24dfe7eae0 in RUN_ALL_TESTS() third-party-buck/platform010/build/googletest/include/gtest/gtest.h:2490 #15 0x7f24dfe7e758 in main common/gtest/LightMain.cpp:20 #16 0x7f24d9528656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #17 0x7f24d9528717 in __libc_start_main_impl /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3 #18 0x2bfe70 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

jiyuanzFB avatar Aug 17 '22 17:08 jiyuanzFB

hi @jiyuanzFB could you please share the shapes and test environment ? what is the difference between the CI and internal test environment? I can not reproduce the issue in my local environment. Thanks.

CaoE avatar Aug 18 '22 02:08 CaoE

Make the types of M, N, ld_src, ld_dst consistent.

CaoE avatar Aug 25 '22 07:08 CaoE

hi @jianyuh @jiyuanzFB Could you try if the internal UT error will be fixed. Thank you very much.

CaoE avatar Aug 25 '22 08:08 CaoE

@jiyuanzFB has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 29 '22 18:08 facebook-github-bot

@CaoE Thanks a lot for the awesome contribution!

jianyuh avatar Aug 30 '22 21:08 jianyuh