oneflow
oneflow copied to clipboard
Speed up the training
Mentioned in https://github.com/Oneflow-Inc/OneTeam/issues/1735, some operators might need to be run as late as possible since they have a large activation time in cpu. In this feature, we move those operators backward and reduce the idle time in cuda by 40% (11.5ms -> 7ms per iteration)
Currently no obvious speeding up