runtime icon indicating copy to clipboard operation
runtime copied to clipboard

[wasm] AOT tests *build* timing out on Linux

Open radical opened this issue 1 year ago • 11 comments

AOT, and sometimes EAT(EnableAggressiveTrimming) builds have been timing out on linux. The first failing rolling AOT build was 33446fb1 . Note that the corresponding EAT build did not fail. The first EAT failure was this for 22ba7d60 .

The last successful build was 4c500699 .

The changes responsible should be 4c500699...33446fb1 , or on the outside - 4c500699...22ba7d60 .

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0

radical avatar Jan 16 '24 20:01 radical

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details

AOT, and sometimes EAT(EnableAggressiveTrimming) builds have been timing out on linux. The first failing rolling AOT build was 33446fb1 . Note that the corresponding EAT build did not fail. The first EAT failure was this for 22ba7d60 .

The last successful build was 4c500699 .

The changes responsible should be 4c500699...33446fb1 , or on the outside - 4c500699...22ba7d60 .

Author: radical
Assignees: -
Labels:

arch-wasm, blocking-clean-ci, area-Build-mono

Milestone: -

ghost avatar Jan 16 '24 20:01 ghost

https://dev.azure.com/dnceng-public/public/_build/results?buildId=538124&view=logs&j=294c8fbc-17f0-5954-a99b-5617e6d3116c&t=0ddd52f3-02e2-5b0f-59e3-5c911ea97724&l=1

lewing avatar Jan 24 '24 19:01 lewing

@vitek-karas this appears to be a linker hang, it is difficult to see in action because AZDO clears the data when it cancels the task

lewing avatar Jan 24 '24 19:01 lewing

Tagging subscribers to this area: @agocke, @sbomer, @vitek-karas See info in area-owners.md if you want to be subscribed.

Issue Details

AOT, and sometimes EAT(EnableAggressiveTrimming) builds have been timing out on linux. The first failing rolling AOT build was 33446fb1 . Note that the corresponding EAT build did not fail. The first EAT failure was this for 22ba7d60 .

The last successful build was 4c500699 .

The changes responsible should be 4c500699...33446fb1 , or on the outside - 4c500699...22ba7d60 .

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
Author: radical
Assignees: -
Labels:

arch-wasm, blocking-clean-ci, untriaged, area-Build-mono, area-Tools-ILLink

Milestone: -

ghost avatar Jan 24 '24 19:01 ghost

@vitek-karas this appears to be a linker hang, it is difficult to see in action because AZDO clears the data when it cancels the task

This is the last log from LibraryTests_AOT line, before it gets cleared by Azdo image

matouskozak avatar Jan 24 '24 19:01 matouskozak

As a workaround, would it be possible to stop building tests in parallel in these lanes ?

vargaz avatar Jan 24 '24 20:01 vargaz

We're going to try https://github.com/dotnet/runtime/pull/97491 to work around this

lewing avatar Jan 25 '24 17:01 lewing

Sorry for the delay, I tried to repro this but it's hard to tell. Locally (on a DevBox in WSL) the Full AOT build took 18+ minutes, but it fails when packing the tests (no idea why), so it hasn't really finished. But it did run all of the trimming. I didn't see any specific trimming take really long time. What we would probably need is to try this on bits before it started to fail and after. It's possible something caused trimming to be slower in general and overall the build just crosses some threshold.

@matouskozak would you be able to try to get this data?

vitek-karas avatar Jan 30 '24 09:01 vitek-karas

We're going to try #97491 to work around this @lewing looks like https://github.com/dotnet/runtime/pull/97491 fixed the EAT lines, but the linux LibraryTests_AOT line is still crashing (https://dev.azure.com/dnceng-public/public/_build/results?buildId=546947&view=logs&j=58dc7ccb-0414-5dd3-62a5-bf2e63258b7c&t=4105ec49-25d1-5748-9e28-e40bff74a16b)

matouskozak avatar Feb 01 '24 10:02 matouskozak

We're going to try #97491 to work around this

@lewing looks like #97491 fixed the EAT lines, but the linux LibraryTests_AOT line is still crashing (https://dev.azure.com/dnceng-public/public/_build/results?buildId=546947&view=logs&j=58dc7ccb-0414-5dd3-62a5-bf2e63258b7c&t=4105ec49-25d1-5748-9e28-e40bff74a16b)

@matouskozak I guess we could disable the parallel build there too? Do we know if this is just excessive slowness or something else. The EAT lanes just trim without doing AOT so I wouldn't expect them to timeout

lewing avatar Feb 05 '24 17:02 lewing

@rmarinho you mentioned you were seeing something like this in the preview 2 builds you were testing, can you try again and let us know if setting the cpus to 1 avoids your issue as well

lewing avatar Feb 12 '24 19:02 lewing