torchrec
torchrec copied to clipboard
benchmark of fbgemm op - permute_multi_embedding
Summary:
performance notes
The good:
- the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes.
-
_all_keys_used_onceis no longer needed - no longer need a torch.cat before calling the old operator
- no need to use
_pin_and_movefor the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing.
The same bad:
- it requires several HtoD communications (move tensor to device):
a) [resolved] 3 tensors, which are
permutes,input_lengths, andoutput_lengths. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. - tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous.
benchmark
- op-level results
INFO:root:size: 1024 x 57168; permute_multi_embedding: 1.5612200498580933 ms; permute_pooled_embs_auto_grad: 0.9015970826148987 ms
INFO:root:size: 1024 x 134096; permute_multi_embedding: 3.0794131755828857 ms; permute_pooled_embs_auto_grad: 2.114053726196289 ms
INFO:root:size: 1024 x 136752; permute_multi_embedding: 2.6919198036193848 ms; permute_pooled_embs_auto_grad: 2.159184455871582 ms
INFO:root:size: 1024 x 260944; permute_multi_embedding: 4.805435180664063 ms; permute_pooled_embs_auto_grad: 4.098493576049805 ms
INFO:root:size: 1024 x 538432; permute_multi_embedding: 9.359790802001953 ms; permute_pooled_embs_auto_grad: 8.504887580871582 ms
INFO:root:size: 1024 x 536592; permute_multi_embedding: 9.375926017761232 ms; permute_pooled_embs_auto_grad: 8.459586143493652 ms
- fn-level results
_regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0
KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0
KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0
permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0
_regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0
KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0
KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0
permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0
traces
[[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json
-rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json
-rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json
Differential Revision: D58906839
This pull request was exported from Phabricator. Differential Revision: D58906839
This pull request was exported from Phabricator. Differential Revision: D58906839
This pull request was exported from Phabricator. Differential Revision: D58906839