returnn
returnn copied to clipboard
Speedup eager execution
Here I want to collect some things to be done to speed up eager-mode execution. Most of it did not really matter in graph-mode execution when those extra things are only executed once. That are extra checks, or just slightly inefficient handling.
It is mostly about the RF with PyTorch backend, but also about potentially other eager-mode backends (e.g. TF eager-mode), or also just about faster usage of our Tensor
and Dim
classes.
- [x] f09222e89cfe3cec4bb96dcf6d7aa289840ecb13: Tensor/Dim, some fast path optimizations: Dim
__eq__
,__ne__
, Tensorget_axes_from_description
,get_axis_from_description
- [x] fa9818c, #1396: Dim
__eq__
- [x] 01d0653, #1399: Dim
__eq__
- [x] dc14a2c, #1400: Dim
__hash__
- [x] 361e238, #1401: Tensor
raw_tensor
assign - [x] abff2a8 (49b69ed): Tensor
copy
more efficient via betterget_kwargs
- [x] 07078b9: Tensor
copy
, directly assign_raw_tensor
- [x] 2e104f5: Tensor avoid
dims_set
, preferdims
, if possible - [x] Tensor
__add__
etc: avoid_rf()
? more directrf.combine
call, or even directly backendcombine
? (#1403)- [x] Even directly implement this in C++, template for all variants, make sure that the common case (eager framework) does not need to get back to Python at all -> #1403
- [x] All involved logic like
convert_to_tensor
,bin_op_out_template
,combine_raw
, etc is inline in C++ -> #1403 - [x]
convert_to_tensor
also can have special code to allow for scalars and just keep them that way. (#1403) - [x]
bin_op_out_template
can use some simpler variant ofcopy_compatible_to
which returns only the raw tensor, not aTensor
:copy_compatible_to_raw
(#1403)
- [x]
rf.combine
: avoidimport
(obsolete with #1403) - [x] RF backend
combine
: fasterbin_op_out_template
, opt case of scalars (#1403) - [x] RF PT
combine_raw
(obsolete with #1403) - [x] Tensor
_raw_backend
(#1403) - [x]
Tensor.copy_compatible_to
,Tensor.copy_compatible_to_raw
,Tensor.copy_compatible_to_dims
, mostly reusing the same logic as incombine
andcompare
if possible, but then also falling back to the generic code (#1404)- [x]
(actually obsolete, we would mostly usecopy_compatible_to
copy_compatible_to_dims
instead, #1404) - [x]
as an alternative tocopy_compatible_to_raw
copy_compatible_to
, common use case, avoids creating broadcastDim
objects (obsolete viacopy_compatible_to_dims_raw
, #1404) - [x]
copy_compatible_to_dims
andcopy_compatible_to_dims_raw
as other simpler alternatives tocopy_compatible_to
, just acceptingdims
for the target (#1404)
- [x]
- [x]
Tensor.copy_compatible_to_dims
andTensor.copy_compatible_to_dims_raw
native (#1409) - [x] Dim
get_dim_value_tensor
: we should cache the value, currently usesreduce_max
every call. See Dimreset_eager
logic. (#1414) - [ ]
Tensor.copy_transpose
native (partly #1413) - [ ]
Tensor.__init__
native - [ ] Dim
cache_seq_mask
not part of_extra
? - [ ] Dim
_extra
used often because of broadcast dummy dims viacopy_compatible_to
. Can we avoid this?It's mostly because of. (Alsoauto_generated
derived_from_op
, probably others, not so rare.) We might avoid it when checking for kwargbatch=None
. What is the behavior ofauto_generated
in RF? Isauto_generated
actually correct forcopy_add_dim_by_tag
? - [ ]
Dim.__init__
native - [x]
Dim
equality (__eq__
,__hash__
) withoutderived_from_op
logic (#1418) - [ ]
Dim.__eq__
native - [ ]
Dim.__hash__
native - [ ]
Dim.get_same_base
native - [ ] Torch backend
matmul
- [ ] Torch backend
reduce
- [x] Torch backend
get_dtype_name_raw
: Just the Torch dtyperepr
call takes time by creating the string object. Inside our C++ code, we could speed up by having just a lookup table for common dtypes. (#1433) - [x]
Dim.copy_from
: We can share cachescache_dyn_size_ext_dev
andcache_seq_mask
. For self-attention, when we create a new copy of the dim, we would otherwise recompute the mask again and again. Note that this is a bit tricky for the cache though, as the dim tags do not match. (#1417) - [x]
Dim
cache_seq_mask
could maybe also have thedim_order
in the key because this might otherwise still require copy_transpose for every usage. (#1417) - [ ] Dim math (
__add__
,__mul__
etc) is not really so rare, thus optimizations:- [x] Dim math cache (#1416)
- [ ] Optimize code of dim math
- [ ] Native implementations
- [ ] Move from
_extra
to main object - [x]
Do not use dim math in the RF? It heavily complicates things, and what is the benefit?Slightly simpler equality checks in some rare cases?On the other side, we commonly need to calculate dyn seq sizes in one way or another in any case. What is actually complicated? The simplification logic maybe only?The equality check?Edit: The equality check now is removed (#1418), and otherwise, we anyway need some similar functionality, so I think we can leave it like this for now.
- [ ]
Linear
: fused op matmul + bias. In Torch, that istorch.nn.functional.linear
- [ ] PyTorch provides header files inside the Python package (in
include
), so directly accessing the underlyingTHPVariable
orTensor
might be an option, or maybe using the C interface. But I'm not sure if this is worth it.
For profiling, we can use many tools, but it's not trivial.
Tools:
- Python generic: PySpy, cProfile etc. Also see here.
- PyTorch profiler, lots of different settings
Conceptually, we want to find multiple things:
- Are there any inefficient ops, unwanted GPU->CPU transfers, etc?
- What is the overhead due to our Python code, due to RF? Where do we have inefficient code here?
To better measure just RF code, we also have some other options:
- Use the "meta" device, which does not perform any computation. Then
we only see RF code in the profiler, and see how much time we spent per step due to that.This is a bit tricky though because not all ops are supported, or some code might not expect this.Edit Ok, it was easier than expected, there were almost no non-supported ops. However, the assumption was wrong that ops on meta device would be for free. In the profiler, I see quite some percentage also in those. It seems it does some Python logic internally. - Use very low dimensions and on CPU only, so that the ops are so fast that this time is not relevant.
Orthogonal:
PyTorch supports multiple ways to compile the code. So we can enable eager maybe for debugging, and/or the first few steps, and then compile a graph just like graph-mode in TF.
Options:
- TorchScript via scripting. Only very limited subset of Python supported, basically not really an option for RF (it does not support our
Tensor
class), except maybe for some very specific carefully selected functions. (#1436) - TorchScript via tracing. Everything supported except control flow (cond, loop). Control flow could be via scripting maybe? (#1436)
-
torch.compile
- I think there are more...
Also see this overview for PyTorch in general: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations
After all the recent optimizations, now looking at the profiling (demo-rf-pt-benchmark.py
, using Py-Spy with --native
with GPU), I don't see any obvious low-level bottleneck anymore.
Sorted by total time:
Sorted by self-time:
The open points above are still possible optimizations we could do, but looking at these profiling stats, I think they will only give very minor speedup.
Overall:
Some things to notice:
- I have about 85% computing time, i.e. it means the dataset is slow here, taking 15% of the time. But this might be just the toy dataset being used here.
- From the lower-level RF functions, there is (in this order):
- 24%:
compareOrCombineViaCached
(all bin ops like add, mul, etc, also comparisons like eq, etc): It's unexpected to me that this takes by far most of the time, twice as much asmatmul
. - 12%:
matmul
: Most of it is via Linear, but then also a bit for attention. I would have expected that this would take most time. - 8.5%:
reduce
: mostly for LayerNorm/BatchNorm. You also see that the masking takes quite some time.
- 24%:
- Looking more on the module level, what takes quite a bit of time is dropout and layernorm/batchnorm.
- We probably can improve more by using some of the fused functional ops from PyTorch. E.g.
linear
,layer_norm
, etc. (also check batch norm, dropout). That would reduce the bin ops quite a bit.
Now for Librispeech, one subepoch (1/20 epoch) takes 16:40min on Nvidia 1080 GPU with a Conformer (setup/config), which is about the same as we see with TensorFlow for exactly the same model, as reported by @mmz33. (Same batch size, but RF PT is missing CTC yet.)
The computation time is 99% now, after #1383 was fixed and is used (num_workers=1
).
Still, the GPU utilization is only about 60% on average (high fluctuations, between 25-100%). We discussed that this is maybe not anymore due to RF/PT but just a principle property of this model, maybe the AED part, maybe the LSTM-based decoder. E.g. when using batch size 1, you probably would have no bottleneck due to RF/PT, and the GPU is maximally used from PT side, but still you would probably have only a small GPU utilization, as many cores would idle most of the time. A few ops (LSTM) also cannot be parallelized at all then.
So maybe we already did most of the work here on the RF optimization side, and it can be considered as done. Or even if we want to optimize this more, maybe scripting/tracing (#1436) makes more sense now at this point.
There are a few other orthogonal optimizations, which I might try next, as listed here: https://github.com/rwth-i6/returnn/wiki/PyTorch-optimizations