FBGEMM
FBGEMM copied to clipboard
Add Mx2, Mx4, 2xN, and 4xN avx512 transpose
Add Mx2, Mx4, 2xN, and 4xN specific transposes on avx512 to improve the transpose performance of shapes of Mx2, Mx4, 2xN, and 4xN.
- When the shape is Mx2 or Mx4 and N == ld_src, Mx2 and Mx4 transposes will achieve higher performance.
- When the shape is 2xN or 4xN and M == ld_dst, 2xN and 4xN transposes will achieve higher performance.
Deploy Preview for eclectic-stroopwafel-199537 canceled.
Name | Link |
---|---|
Latest commit | d5621e01cb64ff9b023b0d3bb8aa22bd93ecf11e |
Latest deploy log | https://app.netlify.com/sites/eclectic-stroopwafel-199537/deploys/630cff6ed481d800088dc51b |
@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
There are still some optimizations to be done. For eaxmple, reduce the mixing of avx512 and avx2 instructions. Sorry for the inconvenience.
hi @jianyuh
I found that the parameter data types of transpose_avx512
and transpose_simd
are not aligned.
template <typename T>
FBGEMM_API void transpose_simd(
unsigned M,
unsigned N,
const T* src,
unsigned ld_src,
T* dst,
unsigned ld_dst);
void transpose_avx512(
int64_t M,
int64_t N,
const T* src,
unsigned ld_src,
T* dst,
unsigned ld_dst);
In pytorch, there is also the usage of fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst)
and the type of M, N, ld_src, ld_dst is int64_t.
Do we need to make the parameter types consistent ?
hi @jianyuh I found that the parameter data types of
transpose_avx512
andtranspose_simd
are not aligned.template <typename T> FBGEMM_API void transpose_simd( unsigned M, unsigned N, const T* src, unsigned ld_src, T* dst, unsigned ld_dst); void transpose_avx512( int64_t M, int64_t N, const T* src, unsigned ld_src, T* dst, unsigned ld_dst);
In pytorch, there is also the usage of
fbgemm::transpose_simd<float>(M, N, src, ld_src, dst, ld_dst)
and the type of M, N, ld_src, ld_dst is int64_t. Do we need to make the parameter types consistent ?
transpose_simd
@CaoE Thanks for identifying this issue! Ideally we should make the types of M, N, ld_src, ld_dst consistent. Feel free to update the PR to fix this, or we can follow up and fix it later. cc @jiyuanzFB
@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Hi @CaoE, thanks for adding the optimized transpose to fbgemm. Our internal test reports a memory leak issue on TransposeTest with this patch. Could you please take a look? Thanks.
Failed to parse output from the test suite: Test binary probably crashed during execution, original error: IO error: No result xml files found. Execution directory: /tmp/tpx-20220816-144609.067435-efd55345-7a2d-4e32-9c9c-110bdafeec7e/2bca7546-f38a-44b2-8116-0f2230713ecd stdout: Note: Google Test filter = TransposeTest.TransposeTest [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TransposeTest [ RUN ] TransposeTest.TransposeTest
stderr:
==6336==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6110000086d8 at pc 0x7f24ddb01d34 bp 0x7ffcb8155770 sp 0x7ffcb8155768
READ of size 4 at 0x6110000086d8 thread T0
SCARINESS: 17 (4-byte-read-heap-buffer-overflow)
#0 0x7f24ddb01d33 in fbgemm::internal::transpose_contiguous_32x4_block(unsigned short const*, unsigned short*, int, int) deeplearning/fbgemm/src/UtilsAvx512.cc:1070
#1 0x7f24ddaf9ddd in void fbgemm::internal::transpose_avx512_contiguous_thin
hi @jiyuanzFB could you please share the shapes and test environment ? what is the difference between the CI and internal test environment? I can not reproduce the issue in my local environment. Thanks.
Make the types of M, N, ld_src, ld_dst consistent.
hi @jianyuh @jiyuanzFB Could you try if the internal UT error will be fixed. Thank you very much.
@jiyuanzFB has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@CaoE Thanks a lot for the awesome contribution!