runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Tracking issue for CI build timeouts

Open jkotas opened this issue 3 years ago • 13 comments

 {
    "ErrorMessage" : "##[error]The operation was canceled.",
    "BuildRetry": false
 }

See https://github.com/dotnet/arcade/issues/11072 for more context.

Report

Build Definition Step Name Console log Pull Request
534201 dotnet/runtime Send tests to Helix (Windows) Log dotnet/runtime#97244
2358103 dotnet-runtime Build product Log
534171 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#96386
2358024 dotnet-runtime Build workload artifacts Log
2358017 dotnet-runtime Send job to Helix (Windows) Log
534048 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#97072
533692 dotnet/runtime Send to Helix Log dotnet/runtime#95758
2357687 dotnet-runtime Build product Log
533488 dotnet/runtime Build Tests Log dotnet/runtime#96678
533432 dotnet/runtime Build native test components Log dotnet/runtime#97077
533401 dotnet/runtime Build product Log dotnet/runtime#97212
533135 dotnet/runtime Build Log dotnet/runtime#97075
533061 dotnet/runtime Build Log dotnet/runtime#97202
533081 dotnet/runtime Build product Log dotnet/runtime#97182
533018 dotnet/runtime Build product Log dotnet/runtime#96807
532656 dotnet/runtime Build product Log dotnet/runtime#97185
532798 dotnet/runtime LLVM AOT compile CoreCLR tests Log dotnet/runtime#97189
532750 dotnet/runtime Build native test components Log dotnet/runtime#97129
532726 dotnet/runtime Send to Helix Log dotnet/runtime#97075
532677 dotnet/runtime Build native test components Log dotnet/runtime#97090
532618 dotnet/runtime Build Log dotnet/runtime#96826
2357085 dotnet-runtime Build product Log
532529 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#97114
532532 dotnet/runtime Build product Log dotnet/runtime#95232
532437 dotnet/runtime Generate AOT offsets Log dotnet/runtime#97159
532325 dotnet/runtime Send to Helix Log dotnet/runtime#95758
531991 dotnet/runtime Build native test components Log dotnet/runtime#97087
531999 dotnet/runtime Run Trimming Tests Log dotnet/runtime#96807
2356757 dotnet-runtime Generate AOT offsets Log
2356610 dotnet-runtime Send job to Helix (Unix) Log
531705 dotnet/runtime Build CoreCLR Runtime Log
531697 dotnet/runtime Build product Log
531606 dotnet/runtime Build native test components Log
531515 dotnet/runtime LLVM AOT compile CoreCLR tests Log dotnet/runtime#97076
2356425 dotnet-runtime Send job to Helix (Unix) Log
531484 dotnet/runtime Build product Log dotnet/runtime#97114
531459 dotnet/runtime Prepare TestHost with runtime CoreCLR Log dotnet/runtime#97090
2356407 dotnet-runtime Build product Log
531361 dotnet/runtime Send tests to Helix (Windows) Log dotnet/runtime#97130
2356226 dotnet-runtime Send job to Helix (Unix) Log
531205 dotnet/runtime Send to Helix Log dotnet/runtime#95758
2356184 dotnet-runtime Send job to Helix (Windows) Log
2356223 dotnet-runtime Build product Log
531058 dotnet/runtime Build Tests Log dotnet/runtime#97118
530819 dotnet/runtime Build product Log dotnet/runtime#97104
530821 dotnet/runtime Build product Log dotnet/runtime#97104
530579 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#88167
530528 dotnet/runtime Restore and Build Product Log dotnet/runtime#90505
530465 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#97052
530438 dotnet/runtime Build product Log dotnet/runtime#97011
2355601 dotnet-runtime Send job to Helix (Unix) Log
530184 dotnet/runtime Build product Log dotnet/runtime#96763
530027 dotnet/runtime Build product Log
530017 dotnet/runtime Build product Log
529925 dotnet/runtime Build product Log dotnet/runtime#97079
2355481 dotnet-runtime Send job to Helix (Unix) Log
529815 dotnet/runtime Build CoreCLR Runtime Log
2355307 dotnet-runtime Download IntermediateArtifacts Log
529747 dotnet/runtime LLVM AOT compile CoreCLR tests Log dotnet/runtime#97076
529697 dotnet/runtime Build Tests Log dotnet/runtime#95980
529637 dotnet/runtime Send tests to Helix (Unix) Log
529699 dotnet/runtime Build product Log dotnet/runtime#95980
2355305 dotnet-runtime Send job to Helix (Windows) Log
2355308 dotnet-runtime Send job to Helix (Windows) Log
2355312 dotnet-runtime Send job to Helix (Windows) Log
529666 dotnet/runtime Send to Helix Log
529626 dotnet/runtime Build native test components Log
2355315 dotnet-runtime Build product Log
2355311 dotnet-runtime Build CoreCLR Runtime Log
529634 dotnet/runtime Build product Log
2355267 dotnet-runtime Send job to Helix (Unix) Log
529597 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#97064
2355245 dotnet-runtime Send job to Helix (Windows) Log
529434 dotnet/runtime Send tests to Helix (Unix) Log
529450 dotnet/runtime Build Tests Log dotnet/runtime#97053
529452 dotnet/runtime Build product Log dotnet/runtime#97053
529381 dotnet/runtime Build product Log dotnet/runtime#96650
529303 dotnet/runtime Build product Log dotnet/runtime#96808
529225 dotnet/runtime Send to Helix Log
529237 dotnet/runtime Build product Log dotnet/runtime#97047
529020 dotnet/runtime Send tests to Helix (Unix) Log
2354830 dotnet-runtime Send job to Helix (Windows) Log
529070 dotnet/runtime Send to Helix Log
529009 dotnet/runtime Build native test components Log
2354813 dotnet-runtime Build product Log
2354675 dotnet-runtime Generate AOT offsets Log
528584 dotnet/runtime Build CoreCLR Runtime Log dotnet/runtime#96676
528546 dotnet/runtime Build product Log dotnet/runtime#88167
528325 dotnet/runtime Send tests to Helix (Unix) Log
528411 dotnet/runtime Build product Log dotnet/runtime#97023
2354374 dotnet-runtime Send job to Helix (Windows) Log
2354377 dotnet-runtime Restore and Build Product Log
528321 dotnet/runtime Build CoreCLR Runtime Log
528320 dotnet/runtime Build CoreCLR Runtime Log
528099 dotnet/runtime Build native test components Log dotnet/runtime#96960
528107 dotnet/runtime Build product Log dotnet/runtime#97012
2354095 dotnet-runtime Build product Log
527910 dotnet/runtime Send tests to Helix (Windows) Log dotnet/runtime#96650
527349 dotnet/runtime Build native test components Log dotnet/runtime#96982
527282 dotnet/runtime Build product Log
Displaying 100 of 842 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
3 106 842

Known issue validation

Build: :mag_right: Result validation: :warning: Provided build not found. Provide a valid build in the "Build: :mag_right:" line. Validation performed at: 11/30/2023 9:39:02 PM UTC

jkotas avatar Sep 30 '22 16:09 jkotas

Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.

Issue Details
 {
    "ErrorMessage" : "The operation was canceled.",
    "BuildRetry": false
 }

See https://github.com/dotnet/arcade/issues/11072 for more context.

Author: jkotas
Assignees: -
Labels:

area-Infrastructure, Known Build Error

Milestone: -

ghost avatar Sep 30 '22 16:09 ghost

#82476 can be ignored, it's intentionally running tests in a slower mode to flush out failures

kg avatar Feb 23 '23 02:02 kg

I have noticed that the osx-x64 Release NativeAOT timeouted in my PR #85366, but from its log, it is obvious that it has completed running (https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/252395/logs/1833). The leg took few seconds over two hours. Maybe we could bump the timeout just a bit.

janvorli avatar Apr 27 '23 12:04 janvorli

I have noticed that the osx-x64 Release NativeAOT timeouted in my PR https://github.com/dotnet/runtime/pull/85366, but from its log, it is obvious that it has completed running (https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/252395/logs/1833)

The log shows that the leg was waiting for 1+ hour for a Helix machine. This happens when we are sending more work to a Helix queue than what it has capacity for. Increasing the timeouts is not going to fix it.

jkotas avatar Apr 27 '23 14:04 jkotas

Ah, I can see it now, I've missed that.

janvorli avatar Apr 27 '23 14:04 janvorli

Looking at this, our queue depth is not looking great for OSX 12.

image

Should we be trying to spill some work to 13 and 11?

cc: @dotnet/runtime-infrastructure @dotnet/dnceng

hoyosjs avatar Apr 27 '23 22:04 hoyosjs

Hmm, These are the largest work producers:

image

With PRs like https://github.com/dotnet/runtime/pull/85389 https://github.com/dotnet/runtime/pull/85343 being a ton of the work.

hoyosjs avatar Apr 27 '23 22:04 hoyosjs

image

The runtime time distribution seems to be consistent over the last couple weeks. We are 263043/307320 workitems sent over the last 3 days to the queue. That being said. It looks like in terms of workitem compute time, ~it looks like SDK is sending very long lived workitems~ what makes SDK take so much compute time is the different branches.

hoyosjs avatar Apr 27 '23 23:04 hoyosjs

The following two CI legs have been failing due to timeouts for a at least two weeks:

  • Libraries Test Run release mono linux x64 Debug
  • Libraries Test Run release coreclr linux x64 Debug

@vcsjones is facing it in https://github.com/dotnet/runtime/pull/87099 as well.

Is anyone investigating this issue?

adamsitnik avatar Jun 15 '23 14:06 adamsitnik

Seeing the same legs time out on as @adamsitnik, too (Libraries Test Run release {mono,coreclr} linux x64 Debug)

I see something like this in both of the console logs:

  Sending Job to (Centos.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:centos-stream8-helix...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/f0271bc6-76fb-445b-a33f-c02c71caa5fb/workitems?api-version=2019-06-17
  Sending Job to (AlmaLinux.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:almalinux-8-helix-amd64...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/c66c9c1c-8014-44d0-a94d-9aa1697c740d/workitems?api-version=2019-06-17
  Sending Job to (Debian.11.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-amd64...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/16e4db17-152c-4b75-af4a-d6e4e924cb10/workitems?api-version=2019-06-17
  Sending Job to Ubuntu.1804.Amd64.Open...
  Sent Helix Job; see work items at https://helix.dot.net/api/jobs/a7884197-1f40-4f66-bcdd-ae3b24af1100/workitems?api-version=2019-06-17
  Waiting for completion of job a7884197-1f40-4f66-bcdd-ae3b24af1100 on Ubuntu.1804.Amd64.Open (Details: https://helix.dot.net/api/jobs/a7884197-1f40-4f66-bcdd-ae3b24af1100/details?api-version=2019-06-17 )
  Waiting for completion of job c66c9c1c-8014-44d0-a94d-9aa1697c740d on (AlmaLinux.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:almalinux-8-helix-amd64 (Details: https://helix.dot.net/api/jobs/c66c9c1c-8014-44d0-a94d-9aa1697c740d/details?api-version=2019-06-17 )
  Waiting for completion of job f0271bc6-76fb-445b-a33f-c02c71caa5fb on (Centos.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:centos-stream8-helix (Details: https://helix.dot.net/api/jobs/f0271bc6-76fb-445b-a33f-c02c71caa5fb/details?api-version=2019-06-17 )
  Waiting for completion of job 16e4db17-152c-4b75-af4a-d6e4e924cb10 on (Debian.11.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-amd64 (Details: https://helix.dot.net/api/jobs/16e4db17-152c-4b75-af4a-d6e4e924cb10/details?api-version=2019-06-17 )
  Job f0271bc6-76fb-445b-a33f-c02c71caa5fb on (Centos.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:centos-stream8-helix is completed with 243 finished work items.
  Job 16e4db17-152c-4b75-af4a-d6e4e924cb10 on (Debian.11.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:debian-11-helix-amd64 is completed with 243 finished work items.
  Job a7884197-1f40-4f66-bcdd-ae3b24af1100 on Ubuntu.1804.Amd64.Open is completed with 243 finished work items.
##[error]The operation was canceled.

On both legs, it looks like it's the (AlmaLinux.8.Amd64.Open) job that didn't finish. I wonder if it is the same distro on other runs, too.

lambdageek avatar Jun 16 '23 16:06 lambdageek

There's another issue about the recent AlmaLinux timeouts https://github.com/dotnet/runtime/issues/87667 - the AlmaLinux leg has been removed by https://github.com/dotnet/runtime/pull/87668

lambdageek avatar Jun 16 '23 19:06 lambdageek

This issue seems to match also generic test timeout failures, such as

System.Net.Http.HttpRequestException : Requesting HTTP version 3.0 with version policy RequestVersionExact while unable to establish HTTP/3 connection.
---- System.OperationCanceledException : The operation was canceled.

is that the intent here?

rzikm avatar Nov 30 '23 14:11 rzikm

is that the intent here?

It is not the indent. I have tweaked the matching string.

jkotas avatar Nov 30 '23 21:11 jkotas

Removing Known Build Error label because it is marking Build Analysis green. We want developers either to wait until infra timeout issue is resolved and rerun their tests or to use their judgement to use the escape path to merge.

JulieLeeMSFT avatar Mar 20 '24 18:03 JulieLeeMSFT

Known Build Error label is useful to get stats on the infra issues such as timeout. On the other hand, it allows PRs to get merged whle tests are not fully run. So, it is something we need to think about how we should handle such cases.

JulieLeeMSFT avatar Mar 20 '24 19:03 JulieLeeMSFT

Known Build Error label is useful to get stats on the infra issues such as timeout

This was the sole purpose of this tracking issue. If we do not have it marked as Known Build Error, it can be closed.

jkotas avatar Mar 20 '24 21:03 jkotas