benchmark
benchmark copied to clipboard
V2 Performance Signal Detected by TorchBench CI on '1.13.0.dev20220811+cu113'
TorchBench CI has detected a performance signal.
Base PyTorch version: 1.13.0.dev20220810+cu113
Base PyTorch commit: e1007950484aa1df4a2f87c9c14b514ffd7736a5
Affected PyTorch version: 1.13.0.dev20220811+cu113
Affected PyTorch commit: 3aeb5e4ff9d56ecd680401cfa3f23e97a279efbe
Affected Tests:
- test_train[hf_BigBird-cpu-eager]: +11.08978%
- test_eval[mnasnet1_0-cpu-jit]: -19.72266%
- test_train[timm_regnet-cpu-eager]: -7.86692%
- test_train[timm_resnest-cpu-eager]: +7.16990%
cc @xuzhao9
Workflow: https://github.com/pytorch/benchmark/actions/runs/2841567257
Result:
{
"start": "e1007950484aa1df4a2f87c9c14b514ffd7736a5",
"end": "3aeb5e4ff9d56ecd680401cfa3f23e97a279efbe",
"threshold": 7,
"timeout": 120,
"torchbench_branch": "v2.0",
"result": [
{
"commit1": "017ecb782d2",
"commit1_time": "2022-08-10 21:50:13 +0000",
"commit1_digest": {
"test_train[hf_BigBird-cpu-eager]": 14.196933696744964
},
"commit2": "4e9b969baa6",
"commit2_time": "2022-08-10 21:58:00 +0000",
"commit2_digest": {
"test_train[hf_BigBird-cpu-eager]": 15.590026811161078
}
}
]
}
@malfet @mingfeima Looks like 4e9b969baa6 slows down hf_BigBird test by 10%.
Sounds good, let's revert
Reverted, let's continue discussion here how it might have happened, as pretty thorough performance analysis has been made in https://github.com/pytorch/pytorch/pull/81849
Had torchbench detected any cases where performance increased as result after this change were landed?
Reverted, let's continue discussion here how it might have happened, as pretty thorough performance analysis has been made in pytorch/pytorch#81849
Had torchbench detected any cases where performance increased as result after this change were landed?
No it didn't. Other signal tests (such as test_eval[mnasnet1_0-cpu-jit]) are likely to noise, as the bisector couldn't find a commit that reproduces this performance change.
Generally the cpu tests are more noisy than they were before (e.g. before May 2022). We should also double check if this commit really causes the regression (to rule out the possiblity of noise).
After revert(https://github.com/pytorch/pytorch/commit/799620178cf), we observe performance speedup (https://github.com/pytorch/benchmark/issues/1103), plus we have cross-validated in https://github.com/pytorch/benchmark/issues/1100, so I believe that this is true signal, not a noise.
@xuzhao9 I will have the regression fixed. Which command should I use if I want to run a standalone test, for example this one hf_BigBird-cpu-eager
?
@mingfeima can you please provide some feedback what have caused the regression and how it is fixed right now? And you should be able to run the test in something like:
$ python3 run.py -t train -m eager hf_BigBird -d cpu
@malfet, I am still working on this and I will put detailed explanations when having the issue fixed.
@mingfeima any progress to this issue?
@xuzhao9 sorry I haven't fixed this issue so far, I am still experimenting.
@xuzhao9 do you have profiler
log with and without https://github.com/pytorch/pytorch/commit/4e9b969baa6 on python3 run.py -t train -m eager hf_BigBird -d cpu
? If you can get profiler log on your side, it would be very helpful to debug this issue.
Test results on my local machine indicates on hf_BigBird
, the time on aten::native_layer_norm
has reduced from 223.789ms
to 135.340ms
. But layer_norm takes very small proportion in the model (0.24%)
so the overall end to end time is roughly the same.
For the rest modes: mnasnet1_0
, timm_regnet
, timm_resnest
. The performance with/without https://github.com/pytorch/pytorch/commit/4e9b969baa6 is the same on my local machine.
Currently, https://github.com/pytorch/pytorch/commit/4e9b969baa6 will only affect LayerNorm
and GroupNorm
(these are the only places that RowwiseMoments
are used). Neither LayerNorm
nor GroupNorm
is used in mnasnet1_0
, timm_regnet
, timm_resnest
.
If you can get profiler log on your side, it would be very helpful to debug this issue.
The following is profiler log on my local machine, CLX Xeon, single socket, 20 cores, without https://github.com/pytorch/pytorch/commit/4e9b969baa6 :
Running train method from hf_BigBird on cpu in eager mode with input batch size 2.
STAGE:2022-09-01 09:50:44 25435:25435 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
CPU Total Wall Time: 7392.780 milliseconds
STAGE:2022-09-01 09:52:23 25435:25435 ActivityProfilerController.cpp:300] Completed Stage: Collection
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
autograd::engine::evaluate_function: SliceBackward0 0.00% 3.370ms 24.25% 22.242s 461.415us 48204
autograd::engine::evaluate_function: AddmmBackward0 0.08% 76.393ms 23.02% 21.117s 7.317ms 2886
...
(folded many lines here)
...
autograd::engine::evaluate_function: NativeLayerNorm... 0.02% 15.435ms 0.29% 263.723ms 260.082us 1014
NativeLayerNormBackward0 0.01% 10.187ms 0.27% 248.288ms 244.860us 1014
aten::native_layer_norm_backward 0.23% 214.712ms 0.26% 238.101ms 234.814us 1014
aten::empty 0.26% 237.880ms 0.26% 237.880ms 2.598us 91579
aten::layer_norm 0.01% 6.399ms 0.25% 227.613ms 224.470us 1014
aten::native_layer_norm 0.24% 217.006ms 0.24% 223.789ms 220.699us 1014
aten::as_strided 0.21% 192.382ms 0.21% 192.382ms 0.734us 262275
EmbeddingBackward0 0.00% 3.386ms 0.19% 178.831ms 1.528ms 117
aten::embedding_backward 0.00% 779.000us 0.19% 178.425ms 1.525ms 117
aten::embedding_dense_backward 0.04% 38.450ms 0.19% 178.196ms 1.523ms 117
aten::bernoulli_ 0.19% 170.831ms 0.19% 172.094ms 176.507us 975
aten::transpose 0.12% 114.097ms 0.18% 168.601ms 4.684us 35997
aten::t 0.07% 61.678ms 0.14% 127.668ms 8.824us 14469
autograd::engine::evaluate_function: CatBackward0 0.03% 25.352ms 0.14% 125.612ms 33.550us 3744
aten::narrow 0.05% 50.234ms 0.13% 121.935ms 3.337us 36543
aten::view 0.12% 110.738ms 0.12% 110.738ms 4.097us 27028
aten::select 0.10% 88.305ms 0.12% 109.244ms 3.307us 33033
aten::permute 0.09% 82.949ms 0.11% 101.152ms 4.487us 22542
aten::_reshape_alias 0.11% 100.959ms 0.11% 100.959ms 2.382us 42393
aten::empty_like 0.05% 43.774ms 0.11% 100.298ms 6.267us 16004
CatBackward0 0.02% 22.431ms 0.11% 100.158ms 26.752us 3744
aten::contiguous 0.01% 13.312ms 0.11% 100.116ms 197.467us 507
autograd::engine::evaluate_function: UnsqueezeBackwa... 0.01% 12.487ms 0.10% 90.196ms 16.061us 5616
autograd::engine::evaluate_function: PermuteBackward... 0.02% 22.503ms 0.10% 90.011ms 8.362us 10764
aten::index 0.09% 86.951ms 0.10% 88.402ms 94.447us 936
autograd::engine::evaluate_function: AsStridedBackwa... 0.00% 1.592ms 0.09% 81.400ms 173.932us 468
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 91.727s
with https://github.com/pytorch/pytorch/commit/4e9b969baa6
Running train method from hf_BigBird on cpu in eager mode with input batch size 2.
STAGE:2022-09-01 10:22:27 25743:25743 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
CPU Total Wall Time: 7304.638 milliseconds
STAGE:2022-09-01 10:24:11 25743:25743 ActivityProfilerController.cpp:300] Completed Stage: Collection
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
autograd::engine::evaluate_function: SliceBackward0 0.05% 49.660ms 24.00% 21.797s 452.175us 48204
autograd::engine::evaluate_function: AddmmBackward0 0.07% 68.063ms 23.30% 21.166s 7.334ms 2886
...
(folded many lines here)
...
autograd::engine::evaluate_function: NativeLayerNorm... 0.01% 12.356ms 0.28% 254.865ms 251.346us 1014
NativeLayerNormBackward0 0.01% 9.463ms 0.27% 241.531ms 238.196us 1014
aten::native_layer_norm_backward 0.23% 210.711ms 0.26% 233.046ms 229.828us 1014
ViewBackward1 0.03% 30.548ms 0.25% 225.043ms 22.808us 9867
aten::empty 0.22% 195.624ms 0.22% 195.624ms 2.136us 91579
EmbeddingBackward0 -0.00% -2059.000us 0.20% 181.678ms 1.553ms 117
aten::embedding_backward 0.01% 5.557ms 0.20% 181.244ms 1.549ms 117
aten::embedding_dense_backward 0.05% 41.770ms 0.20% 180.983ms 1.547ms 117
aten::bernoulli_ 0.19% 169.857ms 0.19% 171.101ms 175.488us 975
aten::transpose 0.12% 109.056ms 0.16% 149.681ms 4.158us 35997
aten::layer_norm 0.01% 6.206ms 0.15% 138.623ms 136.709us 1014
aten::native_layer_norm 0.14% 128.666ms 0.15% 135.340ms 133.471us 1014
aten::as_strided 0.14% 126.610ms 0.14% 126.610ms 0.483us 262275
aten::view 0.14% 123.483ms 0.14% 123.483ms 4.514us 27354
autograd::engine::evaluate_function: CatBackward0 0.03% 23.972ms 0.12% 112.859ms 30.144us 3744
aten::t 0.06% 54.690ms 0.12% 111.482ms 7.705us 14469
aten::narrow 0.04% 40.580ms 0.11% 104.052ms 2.847us 36543
aten::contiguous 0.01% 9.726ms 0.11% 98.550ms 194.379us 507
aten::select 0.09% 80.117ms 0.10% 92.236ms 2.792us 33033
aten::permute 0.09% 80.631ms 0.10% 91.323ms 4.051us 22542
aten::_reshape_alias 0.10% 90.283ms 0.10% 90.283ms 2.130us 42393
aten::index 0.10% 87.542ms 0.10% 88.640ms 94.701us 936
CatBackward0 0.02% 20.855ms 0.10% 88.612ms 23.668us 3744
autograd::engine::evaluate_function: UnsqueezeBackwa... 0.01% 12.348ms 0.10% 87.514ms 15.583us 5616
aten::empty_like 0.04% 36.967ms 0.10% 86.533ms 5.407us 16004
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 90.823s
@mingfeima I see, I can try to profile it on our CI machine (Cascade Lake) and get back to you.
@mingfeima the profiler will affect the e2e runs. In general, did you repro the regression on your side?
@mingfeima the profiler will affect the e2e runs. In general, did you repro the regression on your side?
I can't reproduce the regression on my side. hf_BigBird
is slightly faster on my local machine (https://github.com/pytorch/pytorch/commit/4e9b969baa6 improves aten::native_layer_norm
but the improvement on the whole model is very limited since layer norm takes a very small ratio). mnasnet1_0, timm_regnet, timm_resnest
have the same performance on my local machine.
I tried to reproduce it manually on the CI machine, couldn't reproduce it either. It looks like the problem is in our automatic workflow. I apologize that I should manually verify this earlier.
However, I noticed the latency of test_train[hf_BigBird-cpu-eager]
is pretty strange since it seems to be increasing for repeated runs. I will post the full details of the run latencies here.
For 20220810:
"stats": {
"min": 8.480180480051786,
"max": 19.620859859045595,
"mean": 14.22161482241936,
"stddev": 4.422994823395799,
"rounds": 20,
"median": 13.850522811990231,
"iqr": 9.133345060283318,
"q1": 9.863212924683467,
"q3": 18.996557984966785,
"iqr_outliers": 0,
"stddev_outliers": 11,
"outliers": "11;0",
"ld15iqr": 8.480180480051786,
"hd15iqr": 19.620859859045595,
"ops": 0.07031550302034417,
"total": 284.4322964483872,
"data": [
8.480180480051786,
9.12454328685999,
8.714868865907192,
8.672726313117892,
9.508089357055724,
10.21833649231121,
10.349592973012477,
11.55185868917033,
12.454412408173084,
13.284071160946041,
14.416974463034421,
15.882741200271994,
17.603693272918463,
18.11480759177357,
18.724478934891522,
19.268637035042048,
19.56283976789564,
19.42375202383846,
19.620859859045595,
19.45483227306977
],
"iterations": 1
}
For 20220811:
"stats": {
"min": 9.182652074843645,
"max": 22.705447105690837,
"mean": 15.798760233493521,
"stddev": 5.058727678149553,
"rounds": 20,
"median": 15.183048404054716,
"iqr": 10.548157218843699,
"q1": 10.637420593993738,
"q3": 21.185577812837437,
"iqr_outliers": 0,
"stddev_outliers": 10,
"outliers": "10;0",
"ld15iqr": 9.182652074843645,
"hd15iqr": 22.705447105690837,
"ops": 0.06329610584759622,
"total": 315.97520466987044,
"data": [
9.182652074843645,
9.617483580019325,
9.248191575985402,
9.63914623297751,
10.135544024873525,
11.139297163113952,
12.870241076219827,
13.423423586878926,
13.576850049663335,
14.508523595985025,
15.857573212124407,
17.20031792204827,
18.578686751890928,
19.82361737312749,
20.65858831629157,
21.899249809328467,
21.712567309383303,
22.075798138044775,
22.122005771379918,
22.705447105690837
],
"iterations": 1
}
Although we see 20220811 median latency is larger than 20220810, it looks like the stddev of the test is just terrible - the latency keeps increasing through repeated runs, making the median latency doesn't make sense. Looking into why this is happening...
@xuzhao9 Hi any update for this issue?
One more thing to clarify is that performance improvement from https://github.com/pytorch/pytorch/pull/84404 can not be properly reflected by the listed modes in this issue (since LayerNorm
is only used in BigBird but it takes very small portion in the model (0.24%), anyway if you take a look at LayerNorm
alone, the time is reduced from 223.789ms to 135.340ms):
Do you have other models from TB whose has LayerNorm
as a perf hotspot?
@caoe, could you please gather end to end performance to show layer norm improvement?
@mingfeima I looked into this and it seems just because the test is flaky, need sometime to fix it, but I believe it is safe to treat this signal as false-positive.
In my impression LN is often a small layer that doesn't dominate the overall perf, but we are happy to accept new models that can show LN improvement on this.
Let's close this issue since this is a false signal. We can create another issue to address the performance test coverage for LayerNorm.