returnn
returnn copied to clipboard
Speedup eager execution
Here I want to collect some things to be done to speed up eager-mode execution. Most of it did not really matter in graph-mode execution when those extra things are only executed once. That are extra checks, or just slightly inefficient handling.
It is mostly about the RF with PyTorch backend, but also about potentially other eager-mode backends (e.g. TF eager-mode), or also just about faster usage of our Tensor and Dim classes.
- [x] f09222e89cfe3cec4bb96dcf6d7aa289840ecb13: Tensor/Dim, some fast path optimizations: Dim
__eq__,__ne__, Tensorget_axes_from_description,get_axis_from_description - [x] fa9818c, #1396: Dim
__eq__ - [x] 01d0653, #1399: Dim
__eq__ - [x] dc14a2c, #1400: Dim
__hash__ - [x] 361e238, #1401: Tensor
raw_tensorassign - [x] abff2a8 (49b69ed): Tensor
copymore efficient via betterget_kwargs - [x] 07078b9: Tensor
copy, directly assign_raw_tensor - [x] 2e104f5: Tensor avoid
dims_set, preferdims, if possible - [x] Tensor
__add__etc: avoid_rf()? more directrf.combinecall, or even directly backendcombine? (#1403)- [x] Even directly implement this in C++, template for all variants, make sure that the common case (eager framework) does not need to get back to Python at all -> #1403
- [x] All involved logic like
convert_to_tensor,bin_op_out_template,combine_raw, etc is inline in C++ -> #1403 - [x]
convert_to_tensoralso can have special code to allow for scalars and just keep them that way. (#1403) - [x]
bin_op_out_templatecan use some simpler variant ofcopy_compatible_towhich returns only the raw tensor, not aTensor:copy_compatible_to_raw(#1403)
- [x]
rf.combine: avoidimport(obsolete with #1403) - [x] RF backend
combine: fasterbin_op_out_template, opt case of scalars (#1403) - [x] RF PT
combine_raw(obsolete with #1403) - [x] Tensor
_raw_backend(#1403) - [x]
Tensor.copy_compatible_to,Tensor.copy_compatible_to_raw,Tensor.copy_compatible_to_dims, mostly reusing the same logic as incombineandcompareif possible, but then also falling back to the generic code (#1404)- [x]
(actually obsolete, we would mostly usecopy_compatible_tocopy_compatible_to_dimsinstead, #1404) - [x]
as an alternative tocopy_compatible_to_rawcopy_compatible_to, common use case, avoids creating broadcastDimobjects (obsolete viacopy_compatible_to_dims_raw, #1404) - [x]
copy_compatible_to_dimsandcopy_compatible_to_dims_rawas other simpler alternatives tocopy_compatible_to, just acceptingdimsfor the target (#1404)
- [x]
- [x]
Tensor.copy_compatible_to_dimsandTensor.copy_compatible_to_dims_rawnative (#1409) - [x] Dim
get_dim_value_tensor: we should cache the value, currently usesreduce_maxevery call. See Dimreset_eagerlogic. (#1414) - [ ]
Tensor.copy_transposenative (partly #1413) - [ ]
Tensor.__init__native - [ ] Dim
cache_seq_masknot part of_extra? - [ ] Dim
_extraused often because of broadcast dummy dims viacopy_compatible_to. Can we avoid this?It's mostly because of. (Alsoauto_generatedderived_from_op, probably others, not so rare.) We might avoid it when checking for kwargbatch=None. What is the behavior ofauto_generatedin RF? Isauto_generatedactually correct forcopy_add_dim_by_tag? - [ ]
Dim.__init__native - [x]
Dimequality (__eq__,__hash__) withoutderived_from_oplogic (#1418) - [ ]
Dim.__eq__native - [ ]
Dim.__hash__native - [ ]
Dim.get_same_basenative - [ ] Torch backend
matmul - [ ] Torch backend
reduce - [x] Torch backend
get_dtype_name_raw: Just the Torch dtypereprcall takes time by creating the string object. Inside our C++ code, we could speed up by having just a lookup table for common dtypes. (#1433) - [x]
Dim.copy_from: We can share cachescache_dyn_size_ext_devandcache_seq_mask. For self-attention, when we create a new copy of the dim, we would otherwise recompute the mask again and again. Note that this is a bit tricky for the cache though, as the dim tags do not match. (#1417) - [x]
Dimcache_seq_maskcould maybe also have thedim_orderin the key because this might otherwise still require copy_transpose for every usage. (#1417) - [ ] Dim math (
__add__,__mul__etc) is not really so rare, thus optimizations:- [x] Dim math cache (#1416)
- [ ] Optimize code of dim math
- [ ] Native implementations
- [ ] Move from
_extrato main object - [x]
Do not use dim math in the RF? It heavily complicates things, and what is the benefit?Slightly simpler equality checks in some rare cases?On the other side, we commonly need to calculate dyn seq sizes in one way or another in any case. What is actually complicated? The simplification logic maybe only?The equality check?Edit: The equality check now is removed (#1418), and otherwise, we anyway need some similar functionality, so I think we can leave it like this for now.
- [ ]
Linear: fused op matmul + bias. In Torch, that istorch.nn.functional.linear - [ ] PyTorch provides header files inside the Python package (in
include), so directly accessing the underlyingTHPVariableorTensormight be an option, or maybe using the C interface. But I'm not sure if this is worth it.
For profiling, we can use many tools, but it's not trivial.
Tools:
- Python generic: PySpy, cProfile etc. Also see here.
- PyTorch profiler, lots of different settings
Conceptually, we want to find multiple things:
- Are there any inefficient ops, unwanted GPU->CPU transfers, etc?
- What is the overhead due to our Python code, due to RF? Where do we have inefficient code here?
To better measure just RF code, we also have some other options:
- Use the "meta" device, which does not perform any computation. Then
we only see RF code in the profiler, and see how much time we spent per step due to that.This is a bit tricky though because not all ops are supported, or some code might not expect this.Edit Ok, it was easier than expected, there were almost no non-supported ops. However, the assumption was wrong that ops on meta device would be for free. In the profiler, I see quite some percentage also in those. It seems it does some Python logic internally. - Use very low dimensions and on CPU only, so that the ops are so fast that this time is not relevant.
Orthogonal:
PyTorch supports multiple ways to compile the code. So we can enable eager maybe for debugging, and/or the first few steps, and then compile a graph just like graph-mode in TF.
Options:
- TorchScript via scripting. Only very limited subset of Python supported, basically not really an option for RF (it does not support our
Tensorclass), except maybe for some very specific carefully selected functions. (#1436) - TorchScript via tracing. Everything supported except control flow (cond, loop). Control flow could be via scripting maybe? (#1436)
torch.compile- I think there are more...
Also see this overview for PyTorch in general: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations
After all the recent optimizations, now looking at the profiling (demo-rf-pt-benchmark.py, using Py-Spy with --native with GPU), I don't see any obvious low-level bottleneck anymore.
Sorted by total time:
Sorted by self-time:
The open points above are still possible optimizations we could do, but looking at these profiling stats, I think they will only give very minor speedup.
Overall:
Some things to notice:
- I have about 85% computing time, i.e. it means the dataset is slow here, taking 15% of the time. But this might be just the toy dataset being used here.
- From the lower-level RF functions, there is (in this order):
- 24%:
compareOrCombineViaCached(all bin ops like add, mul, etc, also comparisons like eq, etc): It's unexpected to me that this takes by far most of the time, twice as much asmatmul. - 12%:
matmul: Most of it is via Linear, but then also a bit for attention. I would have expected that this would take most time. - 8.5%:
reduce: mostly for LayerNorm/BatchNorm. You also see that the masking takes quite some time.
- 24%:
- Looking more on the module level, what takes quite a bit of time is dropout and layernorm/batchnorm.
- We probably can improve more by using some of the fused functional ops from PyTorch. E.g.
linear,layer_norm, etc. (also check batch norm, dropout). That would reduce the bin ops quite a bit.
Now for Librispeech, one subepoch (1/20 epoch) takes 16:40min on Nvidia 1080 GPU with a Conformer (setup/config), which is about the same as we see with TensorFlow for exactly the same model, as reported by @mmz33. (Same batch size, but RF PT is missing CTC yet.)
The computation time is 99% now, after #1383 was fixed and is used (num_workers=1).
Still, the GPU utilization is only about 60% on average (high fluctuations, between 25-100%). We discussed that this is maybe not anymore due to RF/PT but just a principle property of this model, maybe the AED part, maybe the LSTM-based decoder. E.g. when using batch size 1, you probably would have no bottleneck due to RF/PT, and the GPU is maximally used from PT side, but still you would probably have only a small GPU utilization, as many cores would idle most of the time. A few ops (LSTM) also cannot be parallelized at all then.
So maybe we already did most of the work here on the RF optimization side, and it can be considered as done. Or even if we want to optimize this more, maybe scripting/tracing (#1436) makes more sense now at this point.
There are a few other orthogonal optimizations, which I might try next, as listed here: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations