FBGEMM icon indicating copy to clipboard operation
FBGEMM copied to clipboard

implementation of fbgemm op - regroup_keyed_tensor

Open TroyGarden opened this issue 1 year ago • 3 comments

Summary: X-link: https://github.com/pytorch/torchrec/pull/2128

context

  • current production uses fbgemm.permute_pooled_embs_auto_grad for KT.regroup.
  • It has several downsides: a) it needs to perform a torch.cat operation, costing memory and time b) it only support "no duplicates" in the grouping, otherwise it fallbacks to a slower pytorch native implementation
  • new implementation uses fbgemm.permute_multi_embedding for the same function a) it doesn't need torch.cat, so saves memory and time b) it supports "duplicates" in grouping without sacrificing performance

benchmark results

  • stats sheet |item|baseline|new function|delta perf (%)|notes| |---|---|---|---|---| |runtime|5.2 ms|2.7 ms|48%|wi/o dups| |memory|1.5 K|1.0 K|33%|w/o dups| |runtime|12.3 ms|2.7 ms|78%|w/ dups| |memory|1.0 K|1.0 K|0%|w/ dups|
  • log output
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  13.1 ms | Memory (P90): 1011.0
  permute_multi_embs                  | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 1011.0
  KeyedTensor_regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.2 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  12.3 ms | Memory (P90): 1011.0
  permute_multi_embs_dup              | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 1011.0
  KeyedTensor_regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  12.0 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 1011.0
  • CPU results are very interesting
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.4 ms | Memory (P90):   0.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.7 ms | Memory (P90):   0.0
  [prod] KTRegroupAsDict              | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.6 ms | Memory (P90):   0.0

Differential Revision: D58649553

TroyGarden avatar Jun 22 '24 16:06 TroyGarden

This pull request was exported from Phabricator. Differential Revision: D58649553

facebook-github-bot avatar Jun 22 '24 16:06 facebook-github-bot

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
Latest commit 9af9fd866ed9e7dbff4b5a61ed31f555b810ae97
Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6699c1569db072000841feaf
Deploy Preview https://deploy-preview-2772--pytorch-fbgemm-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jun 22 '24 16:06 netlify[bot]

This pull request was exported from Phabricator. Differential Revision: D58649553

facebook-github-bot avatar Jun 22 '24 16:06 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D58649553

facebook-github-bot avatar Jul 19 '24 01:07 facebook-github-bot

This pull request has been merged in pytorch/FBGEMM@9cf0429b726931cfab72b8264730bea682f32fca.

facebook-github-bot avatar Jul 19 '24 03:07 facebook-github-bot