Summary: X-link: https://github.com/pytorch/torchrec/pull/2128

context

current production uses fbgemm.permute_pooled_embs_auto_grad for KT.regroup.
It has several downsides: a) it needs to perform a torch.cat operation, costing memory and time b) it only support "no duplicates" in the grouping, otherwise it fallbacks to a slower pytorch native implementation
new implementation uses fbgemm.permute_multi_embedding for the same function a) it doesn't need torch.cat, so saves memory and time b) it supports "duplicates" in grouping without sacrificing performance

benchmark results

stats sheet |item|baseline|new function|delta perf (%)|notes| |---|---|---|---|---| |runtime|5.2 ms|2.7 ms|48%|wi/o dups| |memory|1.5 K|1.0 K|33%|w/o dups| |runtime|12.3 ms|2.7 ms|78%|w/ dups| |memory|1.0 K|1.0 K|0%|w/ dups|
log output

  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  13.1 ms | Memory (P90): 1011.0
  permute_multi_embs                  | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 1011.0
  KeyedTensor_regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.2 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  12.3 ms | Memory (P90): 1011.0
  permute_multi_embs_dup              | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 1011.0
  KeyedTensor_regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  12.0 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 1011.0

CPU results are very interesting

  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.4 ms | Memory (P90):   0.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.7 ms | Memory (P90):   0.0
  [prod] KTRegroupAsDict              | B: 1024     | F: 1020     | device: cpu      | Runtime (P90):   0.6 ms | Memory (P90):   0.0

Differential Revision: D58649553

Jun 22 '24 16:06 TroyGarden

This pull request was exported from Phabricator. Differential Revision: D58649553

Jun 22 '24 16:06 facebook-github-bot

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	9af9fd866ed9e7dbff4b5a61ed31f555b810ae97
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6699c1569db072000841feaf
Deploy Preview	https://deploy-preview-2772--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Jun 22 '24 16:06 netlify[bot]

This pull request was exported from Phabricator. Differential Revision: D58649553

Jun 22 '24 16:06 facebook-github-bot

This pull request was exported from Phabricator. Differential Revision: D58649553

Jul 19 '24 01:07 facebook-github-bot

This pull request has been merged in pytorch/FBGEMM@9cf0429b726931cfab72b8264730bea682f32fca.

Jul 19 '24 03:07 facebook-github-bot

implementation of fbgemm op - regroup_keyed_tensor

context

benchmark results

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Deploy Preview for pytorch-fbgemm-docs ready!