implementation of fbgemm op - regroup_keyed_tensor
Summary: X-link: https://github.com/pytorch/torchrec/pull/2128
context
- current production uses
fbgemm.permute_pooled_embs_auto_gradforKT.regroup. - It has several downsides:
a) it needs to perform a
torch.catoperation, costing memory and time b) it only support "no duplicates" in the grouping, otherwise it fallbacks to a slower pytorch native implementation - new implementation uses
fbgemm.permute_multi_embeddingfor the same function a) it doesn't needtorch.cat, so saves memory and time b) it supports "duplicates" in grouping without sacrificing performance
benchmark results
- stats sheet |item|baseline|new function|delta perf (%)|notes| |---|---|---|---|---| |runtime|5.2 ms|2.7 ms|48%|wi/o dups| |memory|1.5 K|1.0 K|33%|w/o dups| |runtime|12.3 ms|2.7 ms|78%|w/ dups| |memory|1.0 K|1.0 K|0%|w/ dups|
- log output
_regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 13.1 ms | Memory (P90): 1011.0
permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 1011.0
KeyedTensor_regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.2 ms | Memory (P90): 1517.0
KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0
_regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 12.3 ms | Memory (P90): 1011.0
permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 1011.0
KeyedTensor_regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 12.0 ms | Memory (P90): 1011.0
KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 1011.0
- CPU results are very interesting
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.4 ms | Memory (P90): 0.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.7 ms | Memory (P90): 0.0
[prod] KTRegroupAsDict | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 0.6 ms | Memory (P90): 0.0
Differential Revision: D58649553
This pull request was exported from Phabricator. Differential Revision: D58649553
Deploy Preview for pytorch-fbgemm-docs ready!
| Name | Link |
|---|---|
| Latest commit | 9af9fd866ed9e7dbff4b5a61ed31f555b810ae97 |
| Latest deploy log | https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6699c1569db072000841feaf |
| Deploy Preview | https://deploy-preview-2772--pytorch-fbgemm-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
This pull request was exported from Phabricator. Differential Revision: D58649553
This pull request was exported from Phabricator. Differential Revision: D58649553
This pull request has been merged in pytorch/FBGEMM@9cf0429b726931cfab72b8264730bea682f32fca.