runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Infrastructure - Status/Health

Open jashook opened this issue 6 years ago • 35 comments

Overview

Please use these queries to discover issues

Blocking CI

Status Issue Build Count
:fire: xunit.extensibility.execution 2.4.2-pre.22 - ` primary signature's s ... 0
:fire: System.Net.Quic.Functional.Tests - Helix timeout N/A
:fire: Mono llvmaot failing for some HardwareIntrinsics tests N/A
:fire: Segmentation fault (SIGSEGV) on LinuxBionic N/A
:fire: [release/7.0][wasm] Multiple JIT.* runtime test failures on rolling ... N/A
:fire: Native crash in JIT.Regression for `mono llvmfullaot Pri0 Runtime Te ... N/A
:fire: System.Data.OleDb.Test - timeout and hangs N/A
:fire: dotnet build returns exit code 1 even when the build passes N/A
:fire: [wasm] Test run crashing with exiting due to exception: 613560632 fr ... N/A
:fire: [wasm][aot] Uncaught errors in System.Runtime.Tests N/A
:fire: Mono crashes test runs N/A
:fire: mono minijit runtime tests failing with `coreclr_initialize failed - ... N/A

Blocking CI Optional

Status Issue Build Count
:fire: Test failure CoreMangLib\system\multicastdelegate\MulticastDelegate ... N/A
:fire: Test failure System.Transactions.Tests.OleTxTests.GetDtcTransaction N/A
:fire: Test failure profiler\gc\gcbasic\gcbasic.cmd N/A
:fire: Test failure readytorun\coreroot_determinism\coreroot_determinism\c ... N/A
:fire: Test failure JIT/Regression/VS-ia64-JIT/V2.0-RTM/b539509/b539509/b5395 ... N/A
:fire: [iOS] XslCompiledTransform tests failing on devices N/A
:fire: Test failure readytorun/coreroot_determinism/coreroot_determinism/core ... N/A
:fire: Test failure System.Xml.Tests.SubtreeReaderTest.RunTests(testCase: 105 ... N/A
:fire: Test failure reflection/Tier1Collectible/Tier1Collectible/Tier1Collect ... N/A
:fire: Test failure System.IO.Tests.FileInfo_SymbolicLinks.CreateSymbolicLink ... N/A
:fire: [wasm] dotnet-runtime-perf: wasm-opt failing with `unexpected expr ... N/A
:fire: ILASM roundtrip failure - src/tests/readytorun/tests/mainv2 N/A
:fire: Test failure JIT\jit64\opt\cse\HugeField1\HugeField1.cmd N/A
:fire: Test failure readytorun\tests\mainv1\mainv1.cmd N/A
:fire: Test failure file_io.GetSystemTimeAsFileTime.test1 N/A
:fire: Test failure System.IO.Pipes.Tests.AnonymousPipeTest_CrossProcess.Serv ... N/A
:fire: Test failure System.Text.Json.Serialization.Tests.StreamTests_Sync.Han ... N/A
:fire: Test failure JIT/Methodical/eh/interactions/gcincatch_ro/gcincatch_ro.cmd N/A
:fire: Harden CoreclrTestLib against child-proc exited N/A
:fire: Test failure System.Threading.Channels.Tests.SyncMultiReaderUnboundedC ... N/A
:fire: Test failure JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd N/A

Blocking Outerloop

Status Issue Build Count
:fire: Test failure Loader\classloader\DictionaryExpansion\DictionaryExpan ... N/A
:fire: Test failure JIT\Regression\VS-ia64-JIT\V1.2-M02\b28158\b28158_64 ... N/A
:fire: Test failure JIT\jit64\opt\cse\hugeexpr1\hugeexpr1.cmd N/A
:fire: Test failure System.Security.Cryptography.X509Certificates.Tests.Revoc ... N/A
:fire: Test failure System.Diagnostics.Tests.ProcessStartInfoTests.StartInfo_ ... N/A

Goals

  1. A minimum 95% passing rate for the runtime pipeline

Resources

  1. runtime pipeline analytics

jashook avatar Oct 16 '19 17:10 jashook

/cc @dotnet/coreclr-infra

jashook avatar Oct 16 '19 17:10 jashook

/cc @dotnet/jit-contrib

jashook avatar Oct 16 '19 17:10 jashook

https://github.com/dotnet/coreclr/issues/26057 Failed to resolve SDK 'Microsoft.DotNet.Helix.Sdk'

jkotas avatar Oct 20 '19 13:10 jkotas

dotnet/coreclr#27453 Test Infrastructure Failure: Access to the path ... is denied

jkotas avatar Oct 26 '19 00:10 jkotas

Summary of the week of 21-Oct-2019

Problems (cross out signifies the problem is fixed)

  1. ~~OSX build machines failing to take work~~
  2. Failed to resolve SDK 'Microsoft.DotNet.Helix.Sdk' dotnet/coreclr#26057
  3. ~~Linux arm musl build fails~~
  4. ~~Clang 5.0 throws a stack trace occasionally when building arm or arm64 targets~~

jashook avatar Oct 28 '19 17:10 jashook

@jashook does this issue need a new owner while you are on vacation?

sandreenko avatar Dec 09 '19 19:12 sandreenko

I assume the ownership is now shared.

/cc @trylek @ViktorHofer @jkoritzinsky @dagood @jaredpar

jashook avatar Dec 11 '19 12:12 jashook

Libraries Build Windows_NT x86 Release leg is failing 100% of time. See https://github.com/dotnet/runtime/pull/967

jkotas avatar Dec 17 '19 03:12 jkotas

Interesting. To me it looks like a code issue rather than an infra hiccup though:

Fatal error. 0xC0000005
   at DynamicClass.WriteTypeWithDateTimeOffsetTypePropertyToJson(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.Runtime.Serialization.ClassDataContract, System.Xml.XmlDictionaryString[])
   at System.Runtime.Serialization.Json.JsonClassDataContract.WriteJsonValueCore(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.RuntimeTypeHandle)
   at System.Runtime.Serialization.Json.JsonDataContract.WriteJsonValue(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.RuntimeTypeHandle)
   at System.Runtime.Serialization.Json.DataContractJsonSerializerImpl.WriteJsonValue(System.Runtime.Serialization.Json.JsonDataContract, System.Runtime.Serialization.XmlWriterDelegator, System.Object, 

Perhaps some traditional shenanigans regarding time zone settings on the test machines?

trylek avatar Dec 17 '19 09:12 trylek

cc @ahsonkhan @steveharter

ViktorHofer avatar Dec 17 '19 11:12 ViktorHofer

That looks like it came from my dotnet/runtime#737, I'm pretty sure I know what the problem is. The odd thing is that the CI was green when the PR was merged and it's not very clear why.

mikedn avatar Dec 17 '19 12:12 mikedn

Looks like the same CI (Libraries Build Windows_NT x86 Release) failed in my PR dotnet/runtime#842

https://helix.dot.net/api/2019-06-17/jobs/068da4f0-9282-4e18-a1de-c2baaecf32b0/workitems/System.Runtime.Serialization.Json.Tests/console

Fatal error. 0xC0000005
   at DynamicClass.WriteTypeWithDateTimeOffsetTypePropertyToJson(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.Runtime.Serialization.ClassDataContract, System.Xml.XmlDictionaryString[])
   at System.Runtime.Serialization.Json.JsonClassDataContract.WriteJsonValueCore(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.RuntimeTypeHandle)
   at System.Runtime.Serialization.Json.JsonDataContract.WriteJsonValue(System.Runtime.Serialization.XmlWriterDelegator, System.Object, System.Runtime.Serialization.Json.XmlObjectSerializerWriteContextComplexJson, System.RuntimeTypeHandle)

henrikse55 avatar Dec 17 '19 13:12 henrikse55

That looks like it came from my dotnet/runtime#737, I'm pretty sure I know what the problem is. The odd thing is that the CI was green when the PR was merged and it's not very clear why.

The reason why your PR was green is because of the current state we’re in.

  1. In order to achieve building live live, we needed to disable running the libraries tests on coreclr PRs.
  2. The only pipeline that runs libraries tests is runtime-libraries which is conditioned to only run when the change includes a change to src/libraries/*. Since your change only touched coreclr, it didn’t run in your PR.

I’m working on fixing this and moving to a single pipeline that always run and libraries tests should run always when coreclr or libraries are touched.

safern avatar Dec 17 '19 14:12 safern

CoreCLR Test Run Windows_NT arm legs are failing 100% for all PRs currently (https://github.com/dotnet/runtime/issues/1097)

jkotas avatar Dec 22 '19 13:12 jkotas

dotnet/runtime#129 Non-deterministic failure hit by CoreCLR tests: Assert failure(PID 2664 [0x00000a68], Thread: 3612 [0x0e1c]): pMethodDesc->GetCallCounter()->IsCallCountingEnabled(pMethodDesc)

jkotas avatar Jan 15 '20 04:01 jkotas

Updated issue for failure in CoreCLR Pri0 Test Run Windows_NT x64 checked tests timing out:

cmdLine:C:\h\w\A995095C\w\B9DA09DC\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd Timed Out
      Test Harness Exitcode is : -100
      To run the test:
      > set CORE_ROOT=C:\h\w\A995095C\p
      > C:\h\w\A995095C\w\B9DA09DC\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd
      Expected: True
      Actual:   False

cc: @josalem is working on fixing it.

safern avatar Jan 16 '20 21:01 safern

https://github.com/dotnet/runtime/issues/2209: Unable to pull image mcr.microsoft.com/...

jkotas avatar Jan 26 '20 18:01 jkotas

A lot of PRs are failing with: Unhandled exception. System.IO.FileLoadException: Could not load file or assembly 'System.Runtime.CompilerServices.Unsafe, Version=5.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. The located assembly's manifest definition does not match the assembly reference. (0x80131040 (FUSION_E_REF_DEF_MISMATCH))\nFile name: 'System.Runtime.CompilerServices.Unsafe, Version=5.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'

https://github.com/dotnet/runtime/pull/2344 is reverting the change that introduced the problem.

jkotas avatar Jan 29 '20 16:01 jkotas

I just saw an issue related to helix in one of my PRs:

https://github.com/dotnet/core-eng/issues/8694

Adding to the description.

safern avatar Jan 30 '20 01:01 safern

It is not convenient to keep updating this issue with all intermittent test failures hit by the CI. I have started marking issues that are intermittently causing CI failures with blocking-clean-ci label, This label did exist in the repo, but it was not used for a while - time to start using it again.

Query: https://github.com/dotnet/runtime/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Ablocking-clean-ci+

jkotas avatar Feb 04 '20 08:02 jkotas

It is not convenient to keep updating this issue with all intermittent test failures hit by the CI

Agree. Going forward I would prefer this be more of a status page for the repository. A place to visit to quickly check if you're running into a known issue and link to a place to find more information.

I have started marking issues that are intermittently causing CI failures with blocking-clean-ci label, This label did exist in the repo, but it was not used for a while - time to start using it again.

+1

jaredpar avatar Feb 04 '20 16:02 jaredpar

Link dotnet/runtime#32835

sandreenko avatar Feb 25 '20 23:02 sandreenko

Unpinning for a bit

danmoseley avatar Jun 23 '20 16:06 danmoseley

@danmosemsft why? This is pinned so that devs can find active infra issues easily.

jaredpar avatar Jun 23 '20 16:06 jaredpar

Because we can only pin 3 issues and I added a new one. Which do you want to drop? :)

danmoseley avatar Jun 23 '20 17:06 danmoseley

FWIW, I find the permanently pinned issues distracting. I am actively forcing myself to avoid clicking on the "x" button because it would unpin the issue for everybody. I have done it several times by accident. Muscle memory: you see "x" next to a thing that you do not want to see anymore, so you automatically click it to make it go away. I wish github allowed me to hide the pinned issues that I have seen hundred times already.

jkotas avatar Jun 23 '20 17:06 jkotas

I hear you - I don't know we have any better way to communicate with the community. Unless we add something to the top of the readme, which may not be noticed.

danmoseley avatar Jun 23 '20 17:06 danmoseley

I am wondering who is the target audience for the pinned issues. The pinned issues communicate the following currently:

  • We have 12 overarching themes that we are working on in .NET 5. Should this rather be mentioned next to the roadmap link in the readme?
  • We have 30 different sources of CI and official build flakiness
  • We are using 5.0 as the milestone for .NET 5.0 issues
  • We are going to rename the test directory in two weeks

Are these the most important things we want everybody in the community to know?

jkotas avatar Jun 23 '20 18:06 jkotas

😁 Maybe the stickies should only be for announcements (sticks around for a week or two only) and anything else should be linked from the readme. My guess is that nobody reads the readme after they've read it once, of course.

danmoseley avatar Jun 23 '20 19:06 danmoseley

Looks like outerloop job is not in a good shape right now (pipeline), all Linux_musl x64/arm64 send tests are failing with:

2020-08-22T21:30:49.2246346Z   Uploading payloads for Job on (Alpine.312.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.12-helix-20200602002622-e06dc59...
2020-08-22T21:30:49.2279177Z /__w/1/s/.packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20407.3/tools/Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(47,5): error : Correlation Payload '/__w/1/s/artifacts/tests/coreclr/Linux_musl.x64.Checked/Tests/Core_Root/' not found. [/__w/1/s/src/coreclr/tests/helixpublishwitharcade.proj]
2020-08-22T21:30:49.2404592Z ##[error].packages/microsoft.dotnet.helix.sdk/5.0.0-beta.20407.3/tools/Microsoft.DotNet.Helix.Sdk.MonoQueue.targets(47,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) Correlation Payload '/__w/1/s/artifacts/tests/coreclr/Linux_musl.x64.Checked/Tests/Core_Root/' not found.

I have not found a separate issue about that, could somebody take a look?

sandreenko avatar Aug 23 '20 07:08 sandreenko