benchmark V2 Performance Signal Detected by TorchBench CI on '1.13.0.dev20220811+cu113'

TorchBench CI has detected a performance signal.

Base PyTorch version: 1.13.0.dev20220810+cu113

Base PyTorch commit: e1007950484aa1df4a2f87c9c14b514ffd7736a5

Affected PyTorch version: 1.13.0.dev20220811+cu113

Affected PyTorch commit: 3aeb5e4ff9d56ecd680401cfa3f23e97a279efbe

Affected Tests:

test_train[hf_BigBird-cpu-eager]: +11.08978%
test_eval[mnasnet1_0-cpu-jit]: -19.72266%
test_train[timm_regnet-cpu-eager]: -7.86692%
test_train[timm_resnest-cpu-eager]: +7.16990%

cc @xuzhao9

Aug 11 '22 17:08 github-actions[bot]

Workflow: https://github.com/pytorch/benchmark/actions/runs/2841567257

Result:

{
  "start": "e1007950484aa1df4a2f87c9c14b514ffd7736a5",
  "end": "3aeb5e4ff9d56ecd680401cfa3f23e97a279efbe",
  "threshold": 7,
  "timeout": 120,
  "torchbench_branch": "v2.0",
  "result": [
    {
      "commit1": "017ecb782d2",
      "commit1_time": "2022-08-10 21:50:13 +0000",
      "commit1_digest": {
        "test_train[hf_BigBird-cpu-eager]": 14.196933696744964
      },
      "commit2": "4e9b969baa6",
      "commit2_time": "2022-08-10 21:58:00 +0000",
      "commit2_digest": {
        "test_train[hf_BigBird-cpu-eager]": 15.590026811161078
      }
    }
  ]
}

Aug 11 '22 20:08 xuzhao9

@malfet @mingfeima Looks like 4e9b969baa6 slows down hf_BigBird test by 10%.

Aug 11 '22 20:08 xuzhao9

Sounds good, let's revert

Aug 11 '22 21:08 malfet

Reverted, let's continue discussion here how it might have happened, as pretty thorough performance analysis has been made in https://github.com/pytorch/pytorch/pull/81849

Had torchbench detected any cases where performance increased as result after this change were landed?

Aug 11 '22 22:08 malfet

Reverted, let's continue discussion here how it might have happened, as pretty thorough performance analysis has been made in pytorch/pytorch#81849

Had torchbench detected any cases where performance increased as result after this change were landed?

No it didn't. Other signal tests (such as test_eval[mnasnet1_0-cpu-jit]) are likely to noise, as the bisector couldn't find a commit that reproduces this performance change.

Generally the cpu tests are more noisy than they were before (e.g. before May 2022). We should also double check if this commit really causes the regression (to rule out the possiblity of noise).

Aug 11 '22 23:08 xuzhao9

After revert(https://github.com/pytorch/pytorch/commit/799620178cf), we observe performance speedup (https://github.com/pytorch/benchmark/issues/1103), plus we have cross-validated in https://github.com/pytorch/benchmark/issues/1100, so I believe that this is true signal, not a noise.

Aug 12 '22 21:08 xuzhao9

@xuzhao9 I will have the regression fixed. Which command should I use if I want to run a standalone test, for example this one hf_BigBird-cpu-eager ?

Aug 15 '22 05:08 mingfeima

@mingfeima can you please provide some feedback what have caused the regression and how it is fixed right now? And you should be able to run the test in something like:

$ python3 run.py -t train -m eager hf_BigBird -d cpu

Aug 15 '22 16:08 malfet

@malfet, I am still working on this and I will put detailed explanations when having the issue fixed.

Aug 16 '22 02:08 mingfeima

@mingfeima any progress to this issue?

Aug 29 '22 23:08 xuzhao9

@xuzhao9 sorry I haven't fixed this issue so far, I am still experimenting.

Aug 31 '22 02:08 mingfeima

@xuzhao9 do you have profiler log with and without https://github.com/pytorch/pytorch/commit/4e9b969baa6 on python3 run.py -t train -m eager hf_BigBird -d cpu ? If you can get profiler log on your side, it would be very helpful to debug this issue.

Test results on my local machine indicates on hf_BigBird, the time on aten::native_layer_norm has reduced from 223.789ms to 135.340ms. But layer_norm takes very small proportion in the model (0.24%) so the overall end to end time is roughly the same.

For the rest modes: mnasnet1_0, timm_regnet, timm_resnest. The performance with/without https://github.com/pytorch/pytorch/commit/4e9b969baa6 is the same on my local machine.

Currently, https://github.com/pytorch/pytorch/commit/4e9b969baa6 will only affect LayerNorm and GroupNorm (these are the only places that RowwiseMoments are used). Neither LayerNorm nor GroupNorm is used in mnasnet1_0, timm_regnet, timm_resnest.

If you can get profiler log on your side, it would be very helpful to debug this issue.

The following is profiler log on my local machine, CLX Xeon, single socket, 20 cores, without https://github.com/pytorch/pytorch/commit/4e9b969baa6 :

Running train method from hf_BigBird on cpu in eager mode with input batch size 2.
STAGE:2022-09-01 09:50:44 25435:25435 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
CPU Total Wall Time: 7392.780 milliseconds
STAGE:2022-09-01 09:52:23 25435:25435 ActivityProfilerController.cpp:300] Completed Stage: Collection
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    autograd::engine::evaluate_function: SliceBackward0         0.00%       3.370ms        24.25%       22.242s     461.415us         48204
    autograd::engine::evaluate_function: AddmmBackward0         0.08%      76.393ms        23.02%       21.117s       7.317ms          2886
                                       ...
                          (folded many lines here) 
                                       ...
autograd::engine::evaluate_function: NativeLayerNorm...         0.02%      15.435ms         0.29%     263.723ms     260.082us          1014
                               NativeLayerNormBackward0         0.01%      10.187ms         0.27%     248.288ms     244.860us          1014
                       aten::native_layer_norm_backward         0.23%     214.712ms         0.26%     238.101ms     234.814us          1014
                                            aten::empty         0.26%     237.880ms         0.26%     237.880ms       2.598us         91579
                                       aten::layer_norm         0.01%       6.399ms         0.25%     227.613ms     224.470us          1014
                                aten::native_layer_norm         0.24%     217.006ms         0.24%     223.789ms     220.699us          1014
                                       aten::as_strided         0.21%     192.382ms         0.21%     192.382ms       0.734us        262275
                                     EmbeddingBackward0         0.00%       3.386ms         0.19%     178.831ms       1.528ms           117
                               aten::embedding_backward         0.00%     779.000us         0.19%     178.425ms       1.525ms           117
                         aten::embedding_dense_backward         0.04%      38.450ms         0.19%     178.196ms       1.523ms           117
                                       aten::bernoulli_         0.19%     170.831ms         0.19%     172.094ms     176.507us           975
                                        aten::transpose         0.12%     114.097ms         0.18%     168.601ms       4.684us         35997
                                                aten::t         0.07%      61.678ms         0.14%     127.668ms       8.824us         14469
      autograd::engine::evaluate_function: CatBackward0         0.03%      25.352ms         0.14%     125.612ms      33.550us          3744
                                           aten::narrow         0.05%      50.234ms         0.13%     121.935ms       3.337us         36543
                                             aten::view         0.12%     110.738ms         0.12%     110.738ms       4.097us         27028
                                           aten::select         0.10%      88.305ms         0.12%     109.244ms       3.307us         33033
                                          aten::permute         0.09%      82.949ms         0.11%     101.152ms       4.487us         22542
                                   aten::_reshape_alias         0.11%     100.959ms         0.11%     100.959ms       2.382us         42393
                                       aten::empty_like         0.05%      43.774ms         0.11%     100.298ms       6.267us         16004
                                           CatBackward0         0.02%      22.431ms         0.11%     100.158ms      26.752us          3744
                                       aten::contiguous         0.01%      13.312ms         0.11%     100.116ms     197.467us           507
autograd::engine::evaluate_function: UnsqueezeBackwa...         0.01%      12.487ms         0.10%      90.196ms      16.061us          5616
autograd::engine::evaluate_function: PermuteBackward...         0.02%      22.503ms         0.10%      90.011ms       8.362us         10764
                                            aten::index         0.09%      86.951ms         0.10%      88.402ms      94.447us           936
autograd::engine::evaluate_function: AsStridedBackwa...         0.00%       1.592ms         0.09%      81.400ms     173.932us           468
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 91.727s

with https://github.com/pytorch/pytorch/commit/4e9b969baa6

Running train method from hf_BigBird on cpu in eager mode with input batch size 2.
STAGE:2022-09-01 10:22:27 25743:25743 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
CPU Total Wall Time: 7304.638 milliseconds
STAGE:2022-09-01 10:24:11 25743:25743 ActivityProfilerController.cpp:300] Completed Stage: Collection
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    autograd::engine::evaluate_function: SliceBackward0         0.05%      49.660ms        24.00%       21.797s     452.175us         48204
    autograd::engine::evaluate_function: AddmmBackward0         0.07%      68.063ms        23.30%       21.166s       7.334ms          2886
                                        ...
                                (folded many lines here)
                                        ...
autograd::engine::evaluate_function: NativeLayerNorm...         0.01%      12.356ms         0.28%     254.865ms     251.346us          1014
                               NativeLayerNormBackward0         0.01%       9.463ms         0.27%     241.531ms     238.196us          1014
                       aten::native_layer_norm_backward         0.23%     210.711ms         0.26%     233.046ms     229.828us          1014
                                          ViewBackward1         0.03%      30.548ms         0.25%     225.043ms      22.808us          9867
                                            aten::empty         0.22%     195.624ms         0.22%     195.624ms       2.136us         91579
                                     EmbeddingBackward0        -0.00%   -2059.000us         0.20%     181.678ms       1.553ms           117
                               aten::embedding_backward         0.01%       5.557ms         0.20%     181.244ms       1.549ms           117
                         aten::embedding_dense_backward         0.05%      41.770ms         0.20%     180.983ms       1.547ms           117
                                       aten::bernoulli_         0.19%     169.857ms         0.19%     171.101ms     175.488us           975
                                        aten::transpose         0.12%     109.056ms         0.16%     149.681ms       4.158us         35997
                                       aten::layer_norm         0.01%       6.206ms         0.15%     138.623ms     136.709us          1014
                                aten::native_layer_norm         0.14%     128.666ms         0.15%     135.340ms     133.471us          1014
                                       aten::as_strided         0.14%     126.610ms         0.14%     126.610ms       0.483us        262275
                                             aten::view         0.14%     123.483ms         0.14%     123.483ms       4.514us         27354
      autograd::engine::evaluate_function: CatBackward0         0.03%      23.972ms         0.12%     112.859ms      30.144us          3744
                                                aten::t         0.06%      54.690ms         0.12%     111.482ms       7.705us         14469
                                           aten::narrow         0.04%      40.580ms         0.11%     104.052ms       2.847us         36543
                                       aten::contiguous         0.01%       9.726ms         0.11%      98.550ms     194.379us           507
                                           aten::select         0.09%      80.117ms         0.10%      92.236ms       2.792us         33033
                                          aten::permute         0.09%      80.631ms         0.10%      91.323ms       4.051us         22542
                                   aten::_reshape_alias         0.10%      90.283ms         0.10%      90.283ms       2.130us         42393
                                            aten::index         0.10%      87.542ms         0.10%      88.640ms      94.701us           936
                                           CatBackward0         0.02%      20.855ms         0.10%      88.612ms      23.668us          3744
autograd::engine::evaluate_function: UnsqueezeBackwa...         0.01%      12.348ms         0.10%      87.514ms      15.583us          5616
                                       aten::empty_like         0.04%      36.967ms         0.10%      86.533ms       5.407us         16004
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 90.823s

Sep 01 '22 05:09 mingfeima

@mingfeima I see, I can try to profile it on our CI machine (Cascade Lake) and get back to you.

Sep 01 '22 16:09 xuzhao9

@mingfeima the profiler will affect the e2e runs. In general, did you repro the regression on your side?

Sep 06 '22 07:09 frank-wei

@mingfeima the profiler will affect the e2e runs. In general, did you repro the regression on your side?

I can't reproduce the regression on my side. hf_BigBird is slightly faster on my local machine (https://github.com/pytorch/pytorch/commit/4e9b969baa6 improves aten::native_layer_norm but the improvement on the whole model is very limited since layer norm takes a very small ratio). mnasnet1_0, timm_regnet, timm_resnest have the same performance on my local machine.

Sep 06 '22 07:09 mingfeima

I tried to reproduce it manually on the CI machine, couldn't reproduce it either. It looks like the problem is in our automatic workflow. I apologize that I should manually verify this earlier.

However, I noticed the latency of test_train[hf_BigBird-cpu-eager] is pretty strange since it seems to be increasing for repeated runs. I will post the full details of the run latencies here.

For 20220810:

            "stats": {
                "min": 8.480180480051786,
                "max": 19.620859859045595,
                "mean": 14.22161482241936,
                "stddev": 4.422994823395799,
                "rounds": 20,
                "median": 13.850522811990231,
                "iqr": 9.133345060283318,
                "q1": 9.863212924683467,
                "q3": 18.996557984966785,
                "iqr_outliers": 0,
                "stddev_outliers": 11,
                "outliers": "11;0",
                "ld15iqr": 8.480180480051786,
                "hd15iqr": 19.620859859045595,
                "ops": 0.07031550302034417,
                "total": 284.4322964483872,
                "data": [
                    8.480180480051786,
                    9.12454328685999,
                    8.714868865907192,
                    8.672726313117892,
                    9.508089357055724,
                    10.21833649231121,
                    10.349592973012477,
                    11.55185868917033,
                    12.454412408173084,
                    13.284071160946041,
                    14.416974463034421,
                    15.882741200271994,
                    17.603693272918463,
                    18.11480759177357,
                    18.724478934891522,
                    19.268637035042048,
                    19.56283976789564,
                    19.42375202383846,
                    19.620859859045595,
                    19.45483227306977
                ],
                "iterations": 1
            }

For 20220811:

            "stats": {
                "min": 9.182652074843645,
                "max": 22.705447105690837,
                "mean": 15.798760233493521,
                "stddev": 5.058727678149553,
                "rounds": 20,
                "median": 15.183048404054716,
                "iqr": 10.548157218843699,
                "q1": 10.637420593993738,
                "q3": 21.185577812837437,
                "iqr_outliers": 0,
                "stddev_outliers": 10,
                "outliers": "10;0",
                "ld15iqr": 9.182652074843645,
                "hd15iqr": 22.705447105690837,
                "ops": 0.06329610584759622,
                "total": 315.97520466987044,
                "data": [
                    9.182652074843645,
                    9.617483580019325,
                    9.248191575985402,
                    9.63914623297751,
                    10.135544024873525,
                    11.139297163113952,
                    12.870241076219827,
                    13.423423586878926,
                    13.576850049663335,
                    14.508523595985025,
                    15.857573212124407,
                    17.20031792204827,
                    18.578686751890928,
                    19.82361737312749,
                    20.65858831629157,
                    21.899249809328467,
                    21.712567309383303,
                    22.075798138044775,
                    22.122005771379918,
                    22.705447105690837
                ],
                "iterations": 1
            }

Although we see 20220811 median latency is larger than 20220810, it looks like the stddev of the test is just terrible - the latency keeps increasing through repeated runs, making the median latency doesn't make sense. Looking into why this is happening...

Sep 08 '22 00:09 xuzhao9

@xuzhao9 Hi any update for this issue?

One more thing to clarify is that performance improvement from https://github.com/pytorch/pytorch/pull/84404 can not be properly reflected by the listed modes in this issue (since LayerNorm is only used in BigBird but it takes very small portion in the model (0.24%), anyway if you take a look at LayerNorm alone, the time is reduced from 223.789ms to 135.340ms):

Do you have other models from TB whose has LayerNorm as a perf hotspot?

@caoe, could you please gather end to end performance to show layer norm improvement?

Sep 15 '22 04:09 mingfeima

@mingfeima I looked into this and it seems just because the test is flaky, need sometime to fix it, but I believe it is safe to treat this signal as false-positive.

In my impression LN is often a small layer that doesn't dominate the overall perf, but we are happy to accept new models that can show LN improvement on this.

Sep 19 '22 19:09 xuzhao9

Let's close this issue since this is a false signal. We can create another issue to address the performance test coverage for LayerNorm.

Sep 20 '22 22:09 xuzhao9

benchmark benchmark copied to clipboard

V2 Performance Signal Detected by TorchBench CI on '1.13.0.dev20220811+cu113'

benchmark
benchmark copied to clipboard