cesium-unreal icon indicating copy to clipboard operation
cesium-unreal copied to clipboard

Slow testing builds on CI

Open csciguy8 opened this issue 2 years ago • 15 comments

Our github actions (CI jobs) complete very slowly, but only sometimes.

The build times don't seem predictable and have too wide a range of outcomes to accept as normal.

Packaging tests example: TestPackage50 / test 30 mins 38 secs - A 'reasonable' time, link 1 hr 53 mins - Excessively long time, link

Units tests example: TestWindows52 / test 11 mins 19 sec - A 'reasonable' time, link 44 mins 28 sec - Excessively long run, link

csciguy8 avatar Aug 21 '23 21:08 csciguy8

Another instance of a very slow build, 2 hr 6 min. (Windows50) https://github.com/CesiumGS/cesium-unreal/actions/runs/5969272697/job/16194821201

csciguy8 avatar Aug 25 '23 21:08 csciguy8

Turns out we can access metrics for instances that have already terminated, via CloudWatch. So let's look at that two hour build that @csciguy8 linked above.

CPU usage looks fine, peaking at 56% over those two hours (chart points indicate the Maximum over each 5 minute sampling interval): image

NetworkIn and NetworkOut look pretty much like I'd expect. There's a peak of IN bytes at the start for less 10 minutes, presumably corresponding to the git checkout and install of nasm and dotnetcore-sdk. And OUT bytes at the end to upload the built package to github actions. In between there is basically nothing. In this chart each datapoint is the sum of all the traffic over that interval: image

Now here's a more interesting one. EBSWriteBytes and EBSReadBytes. Datapoints are the sum of bytes over the interval again: image

So yeah, we're reading a ton of data from EBS. All told, we're reading 52 gigabytes from EBS during the build, and writing 8 GB. I guess that's consistent with reading large parts of Unreal Engine from EBS? Maybe?

I'm not quite sure why we're writing anything to EBS, though. That's probably worth investigating.

DiskReadBytes and DiskWriteBytes is a lot flatter. Again these are sums. These refer - I think - to the local SSD on which we actually do the build. It seems we are using the local SSD, but not continuously. image

So in all, there are large stretches of time in this build where the CPU and local SSD are underutilized, and the only thing that looks anything like "pegged" is EBS.

kring avatar Aug 29 '23 04:08 kring

Let's compare to a similar build that completed much more quickly (37 min): https://github.com/CesiumGS/cesium-unreal/actions/runs/6005660664/job/16288789411

First of all, the instance type (c5d.2xlarge) and AMI used for these two builds was identical.

EBSReadBytes and EBSWriteBytes is really different. Higher Write, much lower read. Total EBSReadyBytes is only 3.5GB over the entire build! Total EBSWriteBytes is about the same still, at about 8GB. image

So... why in the world is the slower build reading over 10x as much data from EBS?!

CPUUtilization doesn't look different in an interesting way: image

SSD use reflects a more compressed build but doesn't look drastically different: image

kring avatar Aug 29 '23 04:08 kring

Let's look at EBSReadOps and EBSWriteOps. Here's the slow build: image

And here's the fast build: image

kring avatar Aug 29 '23 04:08 kring

Thanks for all the investigation @kring

So... why in the world is the slower build reading over 10x as much data from EBS?! All told, we're reading 52 gigabytes from EBS during the build, and writing 8 GB

Feels like something silly is happening here, like a background process building a DB index or something, maybe even originating from the Windows OS? A fresh install of UE 5.0 only is 35GB on disk.

I'm not quite sure why we're writing anything to EBS, though. That's probably worth investigating.

Agreed. I know that Unreal writes user data during builds, but last I looked it was nowhere near 8GB (200mb on my machine). C:/Users/ec2-user/AppData/Local/UnrealEngine

The unreal install also has these folders that may contain newly written files during a build, but on my machine their size isn't significantly different from a fresh install...

C:\Program Files\Epic Games\UE_5.0\Engine\DerivedDataCache
C:\Program Files\Epic Games\UE_5.0\Engine\Intermediate

csciguy8 avatar Aug 29 '23 15:08 csciguy8

Another interesting example where the workflow seems to have gone to lunch (3 hours+) with two different test job instances, in two different places.

https://github.com/CesiumGS/cesium-unreal/actions/runs/6225004383

image

csciguy8 avatar Sep 18 '23 22:09 csciguy8

Here's the GitHub Actions runner log from the one that's stuck at "Download plugin artifact": broken-runner.txt

kring avatar Sep 18 '23 22:09 kring

And the other one: other-broken-runner.txt

kring avatar Sep 18 '23 22:09 kring

The first of these shows that MsMpEng has really high total CPU usage. Which is weird because the real-time scanner is supposed to be disabled.

PS C:\actions-runner\_diag> ps

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    341      17    44080      53772      26.95   3712   0 amazon-cloudwatch-agent
    189      13    20064      17704       1.30   4300   0 amazon-ssm-agent
     86       6     3204       5012       0.03   2104   0 cmd
    143      10     6520      13900       0.14    560   0 conhost
    115       8     5364      11000       3.34    776   0 conhost
    135       9     6488      12540       0.23   1564   0 conhost
    142      10     6524      13856       0.33   2380   0 conhost
     82       7     6172      10764       0.05   3632   0 conhost
    541      20     2144       6548       1.22    548   0 csrss
    164      10     1816       6000       0.17    652   1 csrss
    632      27    18828      44104       0.61   1076   1 dwm
     39       6     1520       4524       0.03    972   1 fontdrvhost
     39       6     1352       4224       0.05    980   0 fontdrvhost
      0       0       60          8                 0   0 Idle
    435      26    11268      45844       0.44   4064   1 LogonUI
   1027      22     5648      17416      74.86    816   0 lsass
    259      13     3036      11092       0.05   5008   0 msdtc
    647     128   257300     208516   1,439.23   1644   0 MsMpEng
    176      35    63960      19836       9.78   2728   0 node
    739      31    94856     112436       1.73   3048   0 powershell
      0       8      328      85072       0.38    156   0 Registry
    587      51    48468      16940       2.17   3964   0 Runner.Listener
    644      55    54552      17068       4.31   3764   0 Runner.Worker
    467      11     4264       9168       0.78    796   0 services
     57       4     1080       1272       0.08    432   0 smss
    452      22     5788      17584       0.14   2212   0 spoolsv
    138      10     1848       8424       0.02   2600   0 sqlwriter
    408      19    26712      31108      13.98   4912   0 ssm-agent-worker
    359      17    21636      23876       3.36   3752   0 ssm-session-worker
    103       9    17412       6968       0.02   4008   0 start-amazon-cloudwatch-agent
    657      15     5932      12292       2.11    496   0 svchost
    259      13     3748      11796       0.31    500   0 svchost
    261      11     2196       9208       0.13    604   0 svchost
    526      19     4620      13060       0.16    844   0 svchost
    206      11     2280      11832       0.03    896   0 svchost
    741      15     4768      14104       0.56    936   0 svchost
    185      11     1808       8548       0.05   1088   0 svchost
    254      14     3156      14944       0.11   1104   0 svchost
    339      14    11636      16672       0.77   1216   0 svchost
    111       8     1220       5616       0.00   1224   0 svchost
    115      14     2916       7424       0.09   1312   0 svchost
    224      10     2108       8000       0.14   1352   0 svchost
    294      20     8448      15916       0.05   1372   0 svchost
    161      10     1660       7912       0.06   1388   0 svchost
    236      13     2856      18380       1.27   1396   0 svchost
    113       7     1176       5984       0.02   1404   0 svchost
    148       9     1724       7984       0.02   1412   0 svchost
    121       8     1232       5840       0.03   1504   0 svchost
    378      14     3896      12108       0.06   1580   0 svchost
    376      18     4740      15812       0.42   1604   0 svchost
    271      16     3732       9924       1.30   1628   0 svchost
    277      11     1912       8968       0.02   1636   0 svchost
    122       8     1296       6288       0.00   1668   0 svchost
    217      13     2132      10032       0.13   1676   0 svchost
    134       9     1532       7008       0.02   1704   0 svchost
    387      31     7056      16016       0.22   1728   0 svchost
    173      11     1744       8448       0.06   1836   0 svchost
    123       8     3788      10968       0.08   1848   0 svchost
    393      13     2756      10564       0.16   1932   0 svchost
    159       9     1716       7556       0.30   2028   0 svchost
    118       7     1224       6152       0.00   2088   0 svchost
    199      15     6668      11756       0.02   2220   0 svchost
    400      16     9572      21092       1.97   2292   0 svchost
    140       8     1592       7680       0.02   2300   0 svchost
    364      15     2644      10900       0.05   2308   0 svchost
    136       9     1492       6884       0.03   2316   0 svchost
    127       8     1224       5836       0.00   2324   0 svchost
    119       8     1400      11136       0.02   2348   0 svchost
    213      11     2648      10892       0.05   2356   0 svchost
    212      12     1672       7744       0.11   2364   0 svchost
    277      27     3900      14512      53.52   2372   0 svchost
    235      13     2912      12348       0.02   2732   0 svchost
    129       8     1300       6680       0.02   2760   0 svchost
    137       8     1316       6492       0.00   3008   0 svchost
    212      14     1960       8600       0.03   3036   0 svchost
    232      14     2652      12152       0.09   3064   0 svchost
    116       8     1348       7316       0.00   3512   0 svchost
    307      17    13584      20208       4.98   4332   0 svchost
    235      12     3704      11060       0.03   4580   0 svchost
   1572       0       44        128      47.34      4   0 System
    155      11     1368       7196       0.06    676   0 wininit
    229      12     2508      15924       0.14    752   1 winlogon
     97       8     3200       7580       2.05   4340   0 winpty-agent
    172      10     2256       9328       0.02   2836   0 WmiPrvSE

The second doesn't have that same pattern:

PS C:\actions-runner\_diag> ps

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    335      17    44132      54484      24.45   3796   0 amazon-cloudwatch-agent
    185      13    20068      17712       1.23   1800   0 amazon-ssm-agent
     86       6     3200       5036       0.00   2088   0 cmd
     81       7     6164      10680       0.16   3080   0 conhost
    145      10     6560      13924       0.13   3648   0 conhost
    142      10     6540      13896       0.03   4564   0 conhost
    115       8     5184      10712       1.05   4668   0 conhost
    135       9     6492      12752       0.16   4996   0 conhost
    661      20     2408       6764       2.30    540   0 csrss
    164      10     1820       5992       0.16    612   1 csrss
    632      27    18840      44084       1.30   1084   1 dwm
     39       6     1368       4236       0.00    924   0 fontdrvhost
     39       6     1508       4520       0.02    932   1 fontdrvhost
      0       0       60          8                 0   0 Idle
    436      26    10992      45524       0.39   4076   1 LogonUI
   1033      22     5740      17504       1.53    776   0 lsass
    259      13     3052      11132       0.05   3708   0 msdtc
    593      92   206320     112904      27.88    796   0 MsMpEng
    752      30    93300     106524       1.05    972   0 powershell
      0       8      296      84356       0.48    156   0 Registry
    741      51    45020      16280       2.14   1404   0 Runner.Listener
    530      55    58648      16356       3.08   4272   0 Runner.Worker
    463      12     4396       9276       0.91    756   0 services
     57       4     1092       1284       0.09    428   0 smss
    452      21     5748      17548       0.14   2168   0 spoolsv
    138      10     1860       8432       0.02   2840   0 sqlwriter
    402      19    27776      31852      14.20   3880   0 ssm-agent-worker
    359      18    24220      25412       1.27   4576   0 ssm-session-worker
    103       9    17176       6920       0.05   4724   0 start-amazon-cloudwatch-agent
    261      12     2180       9208       0.09    484   0 svchost
    527      19     4664      13068       0.19    868   0 svchost
    736      15     4856      14172       0.38    896   0 svchost
    651      16     6020      12312       2.11   1020   0 svchost
    254      14     3156      14908       0.09   1060   0 svchost
    184      11     1808       8552       0.03   1092   0 svchost
    163       9     1784       7568       0.23   1152   0 svchost
    212      14     1968       8628       0.02   1188   0 svchost
    111       8     1240       5632       0.02   1204   0 svchost
    336      14    11248      16312       0.81   1244   0 svchost
    275      20     8708      16040       0.02   1276   0 svchost
    115      14     2924       7460       0.05   1304   0 svchost
    224      10     2088       8008       0.14   1332   0 svchost
    161      10     1692       7944       0.06   1364   0 svchost
    236      13     2800      18336       1.27   1408   0 svchost
    113       7     1180       6000       0.02   1416   0 svchost
    148       9     1708       7976       0.03   1424   0 svchost
    376      14     3900      12188       0.05   1516   0 svchost
    121       8     1240       5852       0.00   1560   0 svchost
    206      11     2252      11824       0.00   1572   0 svchost
    376      18     4788      15816       0.27   1596   0 svchost
    268      15     3772       9936       1.59   1620   0 svchost
    277      11     1908       9056       0.05   1628   0 svchost
    217      13     2156      10060       0.13   1672   0 svchost
    134       9     1556       6908       0.02   1704   0 svchost
    388      31     7052      16008       0.19   1732   0 svchost
    383      13     2708      10552       0.11   1880   0 svchost
    173      11     1752       8368       0.09   1912   0 svchost
    122       8     1256       6252       0.03   1972   0 svchost
    118       7     1236       6168       0.17   2068   0 svchost
    235      12     3692      11064       0.09   2208   0 svchost
    136       9     1540       6932       0.05   2260   0 svchost
    362      15     2700      10940       0.03   2268   0 svchost
    137       8     1536       7632       0.02   2284   0 svchost
    127       8     1240       5844       0.00   2292   0 svchost
    272      27     3864      14488       0.27   2308   0 svchost
    396      16     9032      20572       1.84   2316   0 svchost
    212      12     1676       7748       0.06   2324   0 svchost
    119       8     1368      11108       0.03   2332   0 svchost
    213      12     2712      10928       0.06   2340   0 svchost
    304      17    13104      19116       5.33   2412   0 svchost
    232      14     2652      12164       0.05   2696   0 svchost
    235      13     2980      12416       0.08   2704   0 svchost
    129       8     1312       6696       0.02   2888   0 svchost
    137       8     1320       6496       0.00   3056   0 svchost
    116       8     1316       7304       0.03   3532   0 svchost
    123       8     3768      10916       0.08   3996   0 svchost
    118       8     1476       7152       0.02   4056   0 svchost
   1539       0       40        140      11.77      4   0 System
    155      11     1372       7240       0.06    636   0 wininit
    229      12     2504      15916       0.05    688   1 winlogon
     97       8     2512       6756       0.58   2908   0 winpty-agent

Both VMs are up and working just fine as far as I can tell. So I have no idea why the build has frozen.

kring avatar Sep 18 '23 22:09 kring

Looking at the Actions log, it seems like the plugin artifact download finished fine, and then GitHub Actions decided to take a break and not do anything more. The machine and the service were seemingly still working, though, because they responded to the "cancel" hours later:

2023-09-18T19:11:27.3955128Z Artifact CesiumForUnreal-52-windows-v1.30.1-257-g6c159a0c was downloaded to d:\cesium-unreal\cesium-unreal\download
2023-09-18T19:11:27.3967735Z Artifact download has finished successfully
2023-09-19T01:02:02.0594284Z ##[error]The operation was canceled.

In fact, the VM is still up and working fine now (I removed the ghr:Application tag from the EC2 instance so that it wouldn't be shut down automatically).

@csciguy8 I'll give you the necessary access so you can dig into this yourself if you want.

kring avatar Sep 19 '23 02:09 kring

Did a quick search for MsMpEng, and it seems to plague more than just us.

There are a variety of solutions for disabling this further, whether excluding its own directory from scans, turning Windows Defender off from Local Group Policy Editor, to just scheduling it some other time. Not sure which solution is best for these VMs...

image

csciguy8 avatar Sep 19 '23 18:09 csciguy8

I ran get-mppreference toward the end of a long-running Unity build that was showing very high CPU usage by MsMpEng.exe. Here's the report: mppreferences.txt

Nothing jumps out at me as problematic, but I haven't analyzed it in detail. DisableRealtimeMonitoring is true. As are DisableCatchupFullScan and DisableCatchupQuickScan. So I really have no idea what MsMpEng.exe is doing with all that CPU time.

kring avatar Sep 20 '23 05:09 kring

I want to try this out next time I see this problem while the build is still running: https://learn.microsoft.com/en-us/microsoft-365/security/defender-endpoint/tune-performance-defender-antivirus

kring avatar Sep 20 '23 06:09 kring

So I really have no idea what MsMpEng.exe is doing with all that CPU time.

Maybe scanning newly generated files from the build? An easy tweak might be to configure Defender to exclude the Unreal install folders, intermediate folders, and any artifact output folders

csciguy8 avatar Sep 20 '23 17:09 csciguy8

With real-time monitoring disabled, it shouldn't be scanning newly-generated files like that. But obviously there's some flaw in my understanding of how this all works, so maybe an explicit exclude is worth a try.

kring avatar Sep 20 '23 22:09 kring

Closing this.

I haven't seen this behavior in a while. My guess is that the switch to GitHub hosted runners fixed it.

csciguy8 avatar Jul 03 '24 16:07 csciguy8