torchrec icon indicating copy to clipboard operation
torchrec copied to clipboard

Improve perf of _block_bucketize_sparse_features_cuda_kernel2 - for pooled emb

Open AlbertDachiChen opened this issue 1 year ago • 1 comments

Summary: X-link: https://github.com/pytorch/FBGEMM/pull/2346

This diffs tries to parallelize _block_bucketize_sparse_features_cuda_kernel2 more (the kernel that bucketize the data vector) based on the assumption that the order of id in an id list feature does not matter given that they will be pooled. w/o the change:

INFO:root:Start to benchmark ...
INFO:2024-02-20 12:28:34 463824:476125 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0
INFO:root:uneven_block_bucketize_sparse_features forward: torch.int64, 1574639424 bytes read/write, 50.14653396606445 ms, 31.400762913456834 GB/s
INFO:root:Start to benchmark ...
INFO:root:even_block_bucketize_sparse_features forward: torch.int64, 1574639424 bytes read/write, 38.90265655517578 ms, 40.476398360268355 GB/s

With the change:

INFO:2024-02-20 12:30:07 507530:516345 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0
INFO:root:uneven_block_bucketize_sparse_features forward: torch.int64, 1181149568 bytes read/write, 5.501140117645264 ms, 214.70995879770197 GB/s
INFO:root:Start to benchmark ...
INFO:root:even_block_bucketize_sparse_features forward: torch.int64, 1181149568 bytes read/write, 5.078537464141846 ms, 232.57671649363064 GB/s
[albertchen@devgpu020]~/fbsource/fbcode% buck run mode/opt //deeplearning/fbgemm/fbgemm_gpu:sparse_ops_benchmark -c fbcode.enable_gpu_sections=true -- block-bucketize-sparse-features-bench --row-size=10000 --element-num=49186232 --batch-size=2500 --device cuda

Kernel perf improves by more than 5 times.

Before https://fburl.com/perfdoctor/4bh0qu3x {F1460471239}

After https://fburl.com/perfdoctor/2cnb3s0p {F1460471412}

It is 6 times faster

Differential Revision: D53961140

AlbertDachiChen avatar Feb 23 '24 20:02 AlbertDachiChen

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Feb 23 '24 20:02 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 02 '24 05:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 02 '24 05:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 08 '24 22:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 08 '24 22:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 12 '24 23:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 12 '24 23:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 12 '24 23:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 13 '24 00:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 13 '24 00:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 13 '24 18:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 18 '24 14:03 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D53961140

facebook-github-bot avatar Mar 18 '24 14:03 facebook-github-bot