one-yolov5
one-yolov5 copied to clipboard
关于分类模型训练测试, 每次比PyTorch慢几秒的原因&可复现代码
前言
py-spy 分析
可稳定复现代码
最近计划
前言
在研究 定位 PyTorch 中 Python API 对应的 C++ 代码 https://github.com/Oneflow-Inc/OneTeam/issues/147 时候
试了下 pytorch官网推荐的一个性能定位工具 py-spy
定位了到pr: https://github.com/Oneflow-Inc/one-yolov5/pull/111 在分类模型训练测试, 每次比PyTorch慢几秒的在 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses
这一行
Profiling with `py-spy`
Evaluating the performance impact of code changes in PyTorch can be complicated,
particularly if code changes happen in compiled code. One simple way to profile
both Python and C++ code in PyTorch is to use
py-spy
, a sampling profiler for Python
that has the ability to profile native code and Python code in the same session.
py-spy
can be installed via pip
:
pip install py-spy
To use py-spy
, first write a Python test script that exercises the
functionality you would like to profile. For example, this script profiles
torch.add
:
import torch
t1 = torch.tensor([[1, 1], [1, 1.]])
t2 = torch.tensor([[0, 0], [0, 0.]])
for _ in range(1000000):
torch.add(t1, t2)
Since the torch.add
operation happens in microseconds, we repeat it a large
number of times to get good statistics. The most straightforward way to use
py-spy
with such a script is to generate a flame
graph:
py-spy record -o profile.svg --native -- python test_tensor_tensor_add.py
This will output a file named profile.svg
containing a flame graph you can
view in a web browser or SVG viewer. Individual stack frame entries in the graph
can be selected interactively with your mouse to zoom in on a particular part of
the program execution timeline. The --native
command-line option tells
py-spy
to record stack frame entries for PyTorch C++ code. To get line numbers
for C++ code it may be necessary to compile PyTorch in debug mode by prepending
your setup.py develop
call to compile PyTorch with DEBUG=1
. Depending on
your operating system it may also be necessary to run py-spy
with root
privileges.
py-spy
can also work in an htop
-like "live profiling" mode and can be
tweaked to adjust the stack sampling rate, see the py-spy
readme for more
details.
py-spy 分析
y轴表示函数的调用栈,x轴表示函数的执行时间,那么函数在x轴越宽表示执行时间越长,也说明是性能的瓶颈点。
从下面两张图可以发现 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses
这一行对性能是有一定影响的。
pytorch 后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses
这一行得用放大镜看
oneflow后端 tloss = (tloss * i + loss.item()) / (i + 1) # update mean losses
这一行比较明显
可稳定复现代码
可稳定复现代码
- 使用机器 oneflow27-root
- 2023-03-09 编译的oneflow 版本
- flow.version='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
- torch.version='1.13.0+cu117' 耗时0.11882472038269043
下面代码定义了一个计时的 Profile类,和两个test_torch, test_oneflow 函数
import time
LENGTH = 148*100
class Profile():
# YOLOv5 Profile class. Usage: @Profile() decorator or 'with Profile():' context manager
def __init__(self, v):
self.v = v
def __enter__(self):
self.start = self.time()
return self
def __exit__(self, type, value, traceback):
self.dt = self.time() - self.start # delta-time
print(f'{self.v} 耗时{self.dt}')
def time(self):
return time.time()
def test_oneflow():
import oneflow as flow
dt = Profile(f'{flow.__version__=}')
x = flow.Tensor([1.34]).cuda()
tloss = 0.0
with dt:
for i in range(LENGTH):
tloss = (tloss *i + x.item())/(i+1)
def test_torch():
import torch
dt = Profile(f'{torch.__version__=}')
x = torch.Tensor([1.34]).cuda()
tloss = 0.0
with dt:
for i in range(LENGTH):
tloss = (tloss*i + x.item())/(i+1)
if __name__ == '__main__':
test_oneflow()
test_torch()
import time
LENGTH = 148*100
class Profile():
# YOLOv5 Profile class. Usage: @Profile() decorator or 'with Profile():' context manager
def __init__(self, v):
self.v = v
def __enter__(self):
self.start = self.time()
return self
def __exit__(self, type, value, traceback):
self.dt = self.time() - self.start # delta-time
print(f'{self.v} 耗时{self.dt}')
def time(self):
return time.time()
def test_oneflow():
import oneflow as flow
dt = Profile(f'{flow.__version__=}')
x = flow.Tensor([1.34]).cuda()
tloss = 0.0
with dt:
for i in range(LENGTH):
tloss = (tloss *i + x.item())/(i+1)
def test_torch():
import torch
dt = Profile(f'{torch.__version__=}')
x = torch.Tensor([1.34]).cuda()
tloss = 0.0
with dt:
for i in range(LENGTH):
tloss = (tloss*i + x.item())/(i+1)
if __name__ == '__main__':
test_oneflow()
test_torch()
输出
flow.__version__='0.9.1+cu117.git.a4b7145d01' 耗时0.7273483276367188
torch.__version__='1.13.0+cu117' 耗时0.11882472038269043
最近计划
- 学习定位 PyTorch中算子代码
- nsys 工具上手