cesium-unreal
cesium-unreal copied to clipboard
Slow testing builds on CI
Our github actions (CI jobs) complete very slowly, but only sometimes.
The build times don't seem predictable and have too wide a range of outcomes to accept as normal.
Packaging tests example:
TestPackage50 / test
30 mins 38 secs - A 'reasonable' time, link
1 hr 53 mins - Excessively long time, link
Units tests example:
TestWindows52 / test
11 mins 19 sec - A 'reasonable' time, link
44 mins 28 sec - Excessively long run, link
Another instance of a very slow build, 2 hr 6 min. (Windows50) https://github.com/CesiumGS/cesium-unreal/actions/runs/5969272697/job/16194821201
Turns out we can access metrics for instances that have already terminated, via CloudWatch. So let's look at that two hour build that @csciguy8 linked above.
CPU usage looks fine, peaking at 56% over those two hours (chart points indicate the Maximum over each 5 minute sampling interval):
NetworkIn and NetworkOut look pretty much like I'd expect. There's a peak of IN bytes at the start for less 10 minutes, presumably corresponding to the git checkout and install of nasm and dotnetcore-sdk. And OUT bytes at the end to upload the built package to github actions. In between there is basically nothing. In this chart each datapoint is the sum of all the traffic over that interval:
Now here's a more interesting one. EBSWriteBytes and EBSReadBytes. Datapoints are the sum of bytes over the interval again:
So yeah, we're reading a ton of data from EBS. All told, we're reading 52 gigabytes from EBS during the build, and writing 8 GB. I guess that's consistent with reading large parts of Unreal Engine from EBS? Maybe?
I'm not quite sure why we're writing anything to EBS, though. That's probably worth investigating.
DiskReadBytes and DiskWriteBytes is a lot flatter. Again these are sums. These refer - I think - to the local SSD on which we actually do the build. It seems we are using the local SSD, but not continuously.
So in all, there are large stretches of time in this build where the CPU and local SSD are underutilized, and the only thing that looks anything like "pegged" is EBS.
Let's compare to a similar build that completed much more quickly (37 min): https://github.com/CesiumGS/cesium-unreal/actions/runs/6005660664/job/16288789411
First of all, the instance type (c5d.2xlarge) and AMI used for these two builds was identical.
EBSReadBytes and EBSWriteBytes is really different. Higher Write, much lower read. Total EBSReadyBytes is only 3.5GB over the entire build! Total EBSWriteBytes is about the same still, at about 8GB.
So... why in the world is the slower build reading over 10x as much data from EBS?!
CPUUtilization doesn't look different in an interesting way:
SSD use reflects a more compressed build but doesn't look drastically different:
Let's look at EBSReadOps and EBSWriteOps. Here's the slow build:
And here's the fast build:
Thanks for all the investigation @kring
So... why in the world is the slower build reading over 10x as much data from EBS?! All told, we're reading 52 gigabytes from EBS during the build, and writing 8 GB
Feels like something silly is happening here, like a background process building a DB index or something, maybe even originating from the Windows OS? A fresh install of UE 5.0 only is 35GB on disk.
I'm not quite sure why we're writing anything to EBS, though. That's probably worth investigating.
Agreed. I know that Unreal writes user data during builds, but last I looked it was nowhere near 8GB (200mb on my machine).
C:/Users/ec2-user/AppData/Local/UnrealEngine
The unreal install also has these folders that may contain newly written files during a build, but on my machine their size isn't significantly different from a fresh install...
C:\Program Files\Epic Games\UE_5.0\Engine\DerivedDataCache
C:\Program Files\Epic Games\UE_5.0\Engine\Intermediate
Another interesting example where the workflow seems to have gone to lunch (3 hours+) with two different test job instances, in two different places.
https://github.com/CesiumGS/cesium-unreal/actions/runs/6225004383
Here's the GitHub Actions runner log from the one that's stuck at "Download plugin artifact": broken-runner.txt
And the other one: other-broken-runner.txt
The first of these shows that MsMpEng has really high total CPU usage. Which is weird because the real-time scanner is supposed to be disabled.
PS C:\actions-runner\_diag> ps
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
341 17 44080 53772 26.95 3712 0 amazon-cloudwatch-agent
189 13 20064 17704 1.30 4300 0 amazon-ssm-agent
86 6 3204 5012 0.03 2104 0 cmd
143 10 6520 13900 0.14 560 0 conhost
115 8 5364 11000 3.34 776 0 conhost
135 9 6488 12540 0.23 1564 0 conhost
142 10 6524 13856 0.33 2380 0 conhost
82 7 6172 10764 0.05 3632 0 conhost
541 20 2144 6548 1.22 548 0 csrss
164 10 1816 6000 0.17 652 1 csrss
632 27 18828 44104 0.61 1076 1 dwm
39 6 1520 4524 0.03 972 1 fontdrvhost
39 6 1352 4224 0.05 980 0 fontdrvhost
0 0 60 8 0 0 Idle
435 26 11268 45844 0.44 4064 1 LogonUI
1027 22 5648 17416 74.86 816 0 lsass
259 13 3036 11092 0.05 5008 0 msdtc
647 128 257300 208516 1,439.23 1644 0 MsMpEng
176 35 63960 19836 9.78 2728 0 node
739 31 94856 112436 1.73 3048 0 powershell
0 8 328 85072 0.38 156 0 Registry
587 51 48468 16940 2.17 3964 0 Runner.Listener
644 55 54552 17068 4.31 3764 0 Runner.Worker
467 11 4264 9168 0.78 796 0 services
57 4 1080 1272 0.08 432 0 smss
452 22 5788 17584 0.14 2212 0 spoolsv
138 10 1848 8424 0.02 2600 0 sqlwriter
408 19 26712 31108 13.98 4912 0 ssm-agent-worker
359 17 21636 23876 3.36 3752 0 ssm-session-worker
103 9 17412 6968 0.02 4008 0 start-amazon-cloudwatch-agent
657 15 5932 12292 2.11 496 0 svchost
259 13 3748 11796 0.31 500 0 svchost
261 11 2196 9208 0.13 604 0 svchost
526 19 4620 13060 0.16 844 0 svchost
206 11 2280 11832 0.03 896 0 svchost
741 15 4768 14104 0.56 936 0 svchost
185 11 1808 8548 0.05 1088 0 svchost
254 14 3156 14944 0.11 1104 0 svchost
339 14 11636 16672 0.77 1216 0 svchost
111 8 1220 5616 0.00 1224 0 svchost
115 14 2916 7424 0.09 1312 0 svchost
224 10 2108 8000 0.14 1352 0 svchost
294 20 8448 15916 0.05 1372 0 svchost
161 10 1660 7912 0.06 1388 0 svchost
236 13 2856 18380 1.27 1396 0 svchost
113 7 1176 5984 0.02 1404 0 svchost
148 9 1724 7984 0.02 1412 0 svchost
121 8 1232 5840 0.03 1504 0 svchost
378 14 3896 12108 0.06 1580 0 svchost
376 18 4740 15812 0.42 1604 0 svchost
271 16 3732 9924 1.30 1628 0 svchost
277 11 1912 8968 0.02 1636 0 svchost
122 8 1296 6288 0.00 1668 0 svchost
217 13 2132 10032 0.13 1676 0 svchost
134 9 1532 7008 0.02 1704 0 svchost
387 31 7056 16016 0.22 1728 0 svchost
173 11 1744 8448 0.06 1836 0 svchost
123 8 3788 10968 0.08 1848 0 svchost
393 13 2756 10564 0.16 1932 0 svchost
159 9 1716 7556 0.30 2028 0 svchost
118 7 1224 6152 0.00 2088 0 svchost
199 15 6668 11756 0.02 2220 0 svchost
400 16 9572 21092 1.97 2292 0 svchost
140 8 1592 7680 0.02 2300 0 svchost
364 15 2644 10900 0.05 2308 0 svchost
136 9 1492 6884 0.03 2316 0 svchost
127 8 1224 5836 0.00 2324 0 svchost
119 8 1400 11136 0.02 2348 0 svchost
213 11 2648 10892 0.05 2356 0 svchost
212 12 1672 7744 0.11 2364 0 svchost
277 27 3900 14512 53.52 2372 0 svchost
235 13 2912 12348 0.02 2732 0 svchost
129 8 1300 6680 0.02 2760 0 svchost
137 8 1316 6492 0.00 3008 0 svchost
212 14 1960 8600 0.03 3036 0 svchost
232 14 2652 12152 0.09 3064 0 svchost
116 8 1348 7316 0.00 3512 0 svchost
307 17 13584 20208 4.98 4332 0 svchost
235 12 3704 11060 0.03 4580 0 svchost
1572 0 44 128 47.34 4 0 System
155 11 1368 7196 0.06 676 0 wininit
229 12 2508 15924 0.14 752 1 winlogon
97 8 3200 7580 2.05 4340 0 winpty-agent
172 10 2256 9328 0.02 2836 0 WmiPrvSE
The second doesn't have that same pattern:
PS C:\actions-runner\_diag> ps
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
335 17 44132 54484 24.45 3796 0 amazon-cloudwatch-agent
185 13 20068 17712 1.23 1800 0 amazon-ssm-agent
86 6 3200 5036 0.00 2088 0 cmd
81 7 6164 10680 0.16 3080 0 conhost
145 10 6560 13924 0.13 3648 0 conhost
142 10 6540 13896 0.03 4564 0 conhost
115 8 5184 10712 1.05 4668 0 conhost
135 9 6492 12752 0.16 4996 0 conhost
661 20 2408 6764 2.30 540 0 csrss
164 10 1820 5992 0.16 612 1 csrss
632 27 18840 44084 1.30 1084 1 dwm
39 6 1368 4236 0.00 924 0 fontdrvhost
39 6 1508 4520 0.02 932 1 fontdrvhost
0 0 60 8 0 0 Idle
436 26 10992 45524 0.39 4076 1 LogonUI
1033 22 5740 17504 1.53 776 0 lsass
259 13 3052 11132 0.05 3708 0 msdtc
593 92 206320 112904 27.88 796 0 MsMpEng
752 30 93300 106524 1.05 972 0 powershell
0 8 296 84356 0.48 156 0 Registry
741 51 45020 16280 2.14 1404 0 Runner.Listener
530 55 58648 16356 3.08 4272 0 Runner.Worker
463 12 4396 9276 0.91 756 0 services
57 4 1092 1284 0.09 428 0 smss
452 21 5748 17548 0.14 2168 0 spoolsv
138 10 1860 8432 0.02 2840 0 sqlwriter
402 19 27776 31852 14.20 3880 0 ssm-agent-worker
359 18 24220 25412 1.27 4576 0 ssm-session-worker
103 9 17176 6920 0.05 4724 0 start-amazon-cloudwatch-agent
261 12 2180 9208 0.09 484 0 svchost
527 19 4664 13068 0.19 868 0 svchost
736 15 4856 14172 0.38 896 0 svchost
651 16 6020 12312 2.11 1020 0 svchost
254 14 3156 14908 0.09 1060 0 svchost
184 11 1808 8552 0.03 1092 0 svchost
163 9 1784 7568 0.23 1152 0 svchost
212 14 1968 8628 0.02 1188 0 svchost
111 8 1240 5632 0.02 1204 0 svchost
336 14 11248 16312 0.81 1244 0 svchost
275 20 8708 16040 0.02 1276 0 svchost
115 14 2924 7460 0.05 1304 0 svchost
224 10 2088 8008 0.14 1332 0 svchost
161 10 1692 7944 0.06 1364 0 svchost
236 13 2800 18336 1.27 1408 0 svchost
113 7 1180 6000 0.02 1416 0 svchost
148 9 1708 7976 0.03 1424 0 svchost
376 14 3900 12188 0.05 1516 0 svchost
121 8 1240 5852 0.00 1560 0 svchost
206 11 2252 11824 0.00 1572 0 svchost
376 18 4788 15816 0.27 1596 0 svchost
268 15 3772 9936 1.59 1620 0 svchost
277 11 1908 9056 0.05 1628 0 svchost
217 13 2156 10060 0.13 1672 0 svchost
134 9 1556 6908 0.02 1704 0 svchost
388 31 7052 16008 0.19 1732 0 svchost
383 13 2708 10552 0.11 1880 0 svchost
173 11 1752 8368 0.09 1912 0 svchost
122 8 1256 6252 0.03 1972 0 svchost
118 7 1236 6168 0.17 2068 0 svchost
235 12 3692 11064 0.09 2208 0 svchost
136 9 1540 6932 0.05 2260 0 svchost
362 15 2700 10940 0.03 2268 0 svchost
137 8 1536 7632 0.02 2284 0 svchost
127 8 1240 5844 0.00 2292 0 svchost
272 27 3864 14488 0.27 2308 0 svchost
396 16 9032 20572 1.84 2316 0 svchost
212 12 1676 7748 0.06 2324 0 svchost
119 8 1368 11108 0.03 2332 0 svchost
213 12 2712 10928 0.06 2340 0 svchost
304 17 13104 19116 5.33 2412 0 svchost
232 14 2652 12164 0.05 2696 0 svchost
235 13 2980 12416 0.08 2704 0 svchost
129 8 1312 6696 0.02 2888 0 svchost
137 8 1320 6496 0.00 3056 0 svchost
116 8 1316 7304 0.03 3532 0 svchost
123 8 3768 10916 0.08 3996 0 svchost
118 8 1476 7152 0.02 4056 0 svchost
1539 0 40 140 11.77 4 0 System
155 11 1372 7240 0.06 636 0 wininit
229 12 2504 15916 0.05 688 1 winlogon
97 8 2512 6756 0.58 2908 0 winpty-agent
Both VMs are up and working just fine as far as I can tell. So I have no idea why the build has frozen.
Looking at the Actions log, it seems like the plugin artifact download finished fine, and then GitHub Actions decided to take a break and not do anything more. The machine and the service were seemingly still working, though, because they responded to the "cancel" hours later:
2023-09-18T19:11:27.3955128Z Artifact CesiumForUnreal-52-windows-v1.30.1-257-g6c159a0c was downloaded to d:\cesium-unreal\cesium-unreal\download
2023-09-18T19:11:27.3967735Z Artifact download has finished successfully
2023-09-19T01:02:02.0594284Z ##[error]The operation was canceled.
In fact, the VM is still up and working fine now (I removed the ghr:Application tag from the EC2 instance so that it wouldn't be shut down automatically).
@csciguy8 I'll give you the necessary access so you can dig into this yourself if you want.
Did a quick search for MsMpEng, and it seems to plague more than just us.
There are a variety of solutions for disabling this further, whether excluding its own directory from scans, turning Windows Defender off from Local Group Policy Editor, to just scheduling it some other time. Not sure which solution is best for these VMs...
I ran get-mppreference toward the end of a long-running Unity build that was showing very high CPU usage by MsMpEng.exe. Here's the report:
mppreferences.txt
Nothing jumps out at me as problematic, but I haven't analyzed it in detail. DisableRealtimeMonitoring is true. As are DisableCatchupFullScan and DisableCatchupQuickScan. So I really have no idea what MsMpEng.exe is doing with all that CPU time.
I want to try this out next time I see this problem while the build is still running: https://learn.microsoft.com/en-us/microsoft-365/security/defender-endpoint/tune-performance-defender-antivirus
So I really have no idea what MsMpEng.exe is doing with all that CPU time.
Maybe scanning newly generated files from the build? An easy tweak might be to configure Defender to exclude the Unreal install folders, intermediate folders, and any artifact output folders
With real-time monitoring disabled, it shouldn't be scanning newly-generated files like that. But obviously there's some flaw in my understanding of how this all works, so maybe an explicit exclude is worth a try.
Closing this.
I haven't seen this behavior in a while. My guess is that the switch to GitHub hosted runners fixed it.