runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Dead lettering tests

Open carlossanlop opened this issue 1 year ago • 11 comments

Build Information

Build: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910 Build error leg or test failing: browser-wasm windows Release LibraryTests_Smoke_AOT

Error Message

{
  "ErrorMessage" : "If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.",
  "BuildRetry" : false,
  "ExcludeConsoleLog" : false
}
  • PR: https://github.com/dotnet/runtime/pull/101498
  • Queue: net9.0-browser-Release-wasm-Mono_Release-WasmTestOnV8
  • Job result: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910&view=logs&j=0eae1e18-ea37-5071-1d48-9384b3d4e672&t=7c50f1ca-0ae5-5055-6835-fe157eafc276&l=59
  • Log file: https://helix.dot.net/api/2019-06-17/jobs/528aca7b-21fa-4ecf-8f47-4fa3e533bc93/workitems/WasmTestOnV8-System.Runtime.Tests/console
  • Output:
If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

Report

Build Definition Test Pull Request
673269 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#102139
673132 dotnet/runtime System.IO.Compression.Tests.WorkItemExecution dotnet/runtime#102098
673038 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#96895
672871 dotnet/runtime System.Runtime.Tests.WorkItemExecution
672864 dotnet/runtime System.Runtime.Tests.WorkItemExecution
672884 dotnet/runtime System.Runtime.Tests.WorkItemExecution
672876 dotnet/runtime System.Runtime.Tests.WorkItemExecution
672869 dotnet/runtime System.IO.FileSystem.Net5Compat.Tests.WorkItemExecution
672870 dotnet/runtime System.IO.Hashing.Tests.WorkItemExecution
672883 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
672880 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
672882 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
672865 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
672779 dotnet/runtime chrome-DebuggerTests.DateTimeTestsEFIGS.WorkItemExecution dotnet/runtime#102119
672783 dotnet/runtime Microsoft.Extensions.Logging.Tests.WorkItemExecution dotnet/runtime#102059
672788 dotnet/runtime System.Threading.Tests.WorkItemExecution dotnet/runtime#102126
672661 dotnet/runtime Regressions.WorkItemExecution dotnet/runtime#102115
672767 dotnet/runtime WasmTestOnChrome-ST-System.Linq.Expressions.Tests.WorkItemExecution dotnet/runtime#102091
672758 dotnet/runtime System.Data.DataSetExtensions.Tests.WorkItemExecution dotnet/runtime#102103
672743 dotnet/runtime System.Globalization.Extensions.Tests.WorkItemExecution dotnet/runtime#102117
672730 dotnet/runtime NoWorkload-ST-Wasm.Build.Tests.WorkItemExecution
672676 dotnet/runtime System.IO.Compression.ZipFile.Tests.WorkItemExecution dotnet/runtime#100823
672709 dotnet/runtime WasmTestOnChrome-ST-System.Data.Common.Tests.WorkItemExecution dotnet/runtime#102084
672712 dotnet/runtime tracing.WorkItemExecution dotnet/runtime#102101
672669 dotnet/runtime System.IO.Compression.Brotli.Tests.WorkItemExecution dotnet/runtime#101461
672696 dotnet/runtime Interop.WorkItemExecution dotnet/runtime#102098
672686 dotnet/runtime System.ComponentModel.TypeConverter.Tests.WorkItemExecution dotnet/runtime#101701
672618 dotnet/runtime System.IO.Compression.Tests.WorkItemExecution dotnet/runtime#101020
672574 dotnet/runtime System.IO.Compression.Brotli.Tests.WorkItemExecution dotnet/runtime#101717
672553 dotnet/runtime System.IO.Compression.ZipFile.Tests.WorkItemExecution dotnet/runtime#101977
672558 dotnet/runtime System.IO.Compression.ZipFile.Tests.WorkItemExecution dotnet/runtime#101512
672537 dotnet/runtime System.Diagnostics.DiagnosticSource.Switches.Tests.WorkItemExecution dotnet/runtime#101975
672483 dotnet/runtime System.ComponentModel.Annotations.Tests.WorkItemExecution dotnet/runtime#96895
672478 dotnet/runtime Managed.WorkItemExecution dotnet/runtime#102101
672498 dotnet/runtime Microsoft.Bcl.TimeProvider.Tests.WorkItemExecution
672490 dotnet/runtime System.Threading.Channels.Tests.WorkItemExecution
672505 dotnet/runtime System.Text.Encoding.CodePages.Tests.WorkItemExecution
672488 dotnet/runtime Microsoft.Extensions.Configuration.Tests.WorkItemExecution
672486 dotnet/runtime System.DirectoryServices.Tests.WorkItemExecution
672506 dotnet/runtime Microsoft.Extensions.Configuration.CommandLine.Tests.WorkItemExecution
672487 dotnet/runtime Microsoft.Win32.SystemEvents.Tests.WorkItemExecution
672492 dotnet/runtime System.Numerics.Tests.ToStringTest.ToString_ValidLargeFormat
672494 dotnet/runtime System.Net.NetworkInformation.Tests.PingTest.SendPingToExternalHostWithLowTtlTest
672496 dotnet/runtime System.Tests.DoubleTests.ToString_ValidLargeFormat
672449 dotnet/runtime WasmTestOnChrome-MT-System.Diagnostics.Tracing.Tests.WorkItemExecution
672443 dotnet/runtime WasmTestOnV8-ST-System.Net.Primitives.UnitTests.Tests.WorkItemExecution dotnet/runtime#101982
671841 dotnet/runtime Microsoft.Extensions.Options.SourceGeneration.Unit.Tests.WorkItemExecution
671836 dotnet/runtime Regression_3.WorkItemExecution
671766 dotnet/runtime Microsoft.Extensions.Configuration.CommandLine.Tests.WorkItemExecution dotnet/runtime#101938
671762 dotnet/runtime System.Memory.Data.Tests.WorkItemExecution dotnet/runtime#101977
671701 dotnet/runtime WasmTestOnV8-System.Runtime.Tests.WorkItemExecution
671680 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#102085
671639 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#102081
671395 dotnet/runtime System.Runtime.Tests.WorkItemExecution
671392 dotnet/runtime System.Runtime.Tests.WorkItemExecution
671411 dotnet/runtime System.Runtime.Tests.WorkItemExecution
671416 dotnet/runtime System.IO.FileSystem.Net5Compat.Tests.WorkItemExecution
671390 dotnet/runtime System.Runtime.Tests.WorkItemExecution
671405 dotnet/runtime System.IO.Hashing.Tests.WorkItemExecution
671436 dotnet/runtime System.Runtime.Extensions.Tests.WorkItemExecution dotnet/runtime#101801
671401 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
671408 dotnet/runtime System.Dynamic.Runtime.Tests.WorkItemExecution
671402 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
671394 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
670824 dotnet/runtime System.Xml.Linq.Properties.Tests.WorkItemExecution dotnet/runtime#101580
665394 dotnet/runtime System.IO.Tests.WorkItemExecution dotnet/runtime#101822
670443 dotnet/runtime System.Runtime.Serialization.Json.Tests.WorkItemExecution dotnet/runtime#101801
670356 dotnet/runtime System.Runtime.Tests.WorkItemExecution
670362 dotnet/runtime System.Runtime.Tests.WorkItemExecution
670349 dotnet/runtime System.Runtime.Tests.WorkItemExecution
670346 dotnet/runtime System.Runtime.Tests.WorkItemExecution
670348 dotnet/runtime System.IO.Hashing.Tests.WorkItemExecution
670344 dotnet/runtime System.IO.FileSystem.Tests.WorkItemExecution
670366 dotnet/runtime System.Diagnostics.TraceSource.Tests.WorkItemExecution
670364 dotnet/runtime System.Diagnostics.Tracing.Tests.WorkItemExecution
670345 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
670357 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
670251 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#101681
670248 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#102039
669277 dotnet/runtime System.Runtime.Tests.WorkItemExecution
669292 dotnet/runtime System.Runtime.Tests.WorkItemExecution
669394 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#101717
669270 dotnet/runtime System.Runtime.Tests.WorkItemExecution
669269 dotnet/runtime System.Runtime.Tests.WorkItemExecution
669279 dotnet/runtime System.IO.Hashing.Tests.WorkItemExecution
669295 dotnet/runtime System.IO.FileSystem.Primitives.Tests.WorkItemExecution
669289 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
669284 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
669275 dotnet/runtime System.Formats.Cbor.Tests.WorkItemExecution
669281 dotnet/runtime System.Formats.Tar.Manual.Tests.WorkItemExecution
669067 dotnet/runtime System.Text.Encodings.Web.Tests.WorkItemExecution dotnet/runtime#101891
668830 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#102001
668694 dotnet/runtime System.Runtime.Tests.WorkItemExecution
668586 dotnet/runtime Microsoft.Extensions.Http.Tests.WorkItemExecution dotnet/runtime#101977
668424 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#101940
668063 dotnet/runtime System.Runtime.Tests.WorkItemExecution
668044 dotnet/runtime System.Runtime.Tests.WorkItemExecution
668061 dotnet/runtime System.Runtime.Tests.WorkItemExecution
668062 dotnet/runtime System.Runtime.Tests.WorkItemExecution
668065 dotnet/runtime System.IO.FileSystem.Net5Compat.Tests.WorkItemExecution
Displaying 100 of 479 results

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
13 125 479

carlossanlop avatar Apr 25 '24 02:04 carlossanlop

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

In the same PR's build where the above dead-letter failure was found, there's another run for a similar queue but is Release. While it does not dead-letter immediately, it manages to print a couple lines, then dies:

  • Build error leg: browser-wasm windows Release LibraryTests_Smoke_AOT
  • Queue: net9.0-browser-Release-wasm-Mono_Release-WasmTestOnChrome-
  • Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=654910&view=logs&j=bf575327-8861-5da8-d33e-e12ac4086c09&t=c85eeb3c-de58-5b19-fbd6-7d6b5ac3e88c
  • Log file: https://helix.dot.net/api/2019-06-17/jobs/2736decc-83a1-4c32-9e0a-ef543e0d26f3/workitems/WasmTestOnChrome-System.Runtime.Tests/console
  • Output:
Console log: 'WasmTestOnChrome-System.Runtime.Tests' from job 2736decc-83a1-4c32-9e0a-ef543e0d26f3 (windows.amd64.server2022.open.rt) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:windowsservercore-ltsc2022-helix-webassembly on a000NF0
running %HELIX_CORRELATION_PAYLOAD%\scripts\be8b1ad5c1e9498d89709f26e508c549\execute.cmd in C:\h\w\B0E6099F\w\A1EF089A\e max 3600 seconds

^ It just dies after printing the second line.

I do not want to open a KnownBuildError issue for this specific failure as it would end up grouping anything. I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself: https://github.com/dotnet/runtime/pull/101498 @JulieLeeMSFT @hoyosjs @jkoritzinsky

carlossanlop avatar Apr 25 '24 02:04 carlossanlop

I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself:

People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

jkotas avatar Apr 25 '24 05:04 jkotas

People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?

I was not aware of that. Thanks for sharing! I'll try it next time.

carlossanlop avatar Apr 25 '24 15:04 carlossanlop

What does dead-lettering mean in this context? What is the case where this fails?

agocke avatar Apr 25 '24 16:04 agocke

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.

What this means:

- All attempts to retry execution of this work item were unable to complete.  This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).

Common causes:

- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.

For follow up:

- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).

lewing avatar Apr 25 '24 22:04 lewing

iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?

We often see queues fall over around branch time

Correct. @ilyas1974 is queue dead lettering manually driven or is there some automation involved?

steveisok avatar Apr 25 '24 22:04 steveisok

Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.

For you dead lettering question, the answer is Yes - it's a manual and automated process. We manually deadletter a queue so the changes are immediate. We then add the deadletter information to the helix configuration, so he is persistent for whenever we make changes to helix.

ilyas1974 avatar Apr 26 '24 15:04 ilyas1974

Ok, so is the expectation that tests should be re-run when deadlettering happens? That basically, that run was invalid?

agocke avatar Apr 26 '24 19:04 agocke

I've removed the wasm references in the labels and title bits because wasm is no longer dominating the failures in any way (with the exception of preview4 which has known problems that are fixed in main)

lewing avatar May 02 '24 18:05 lewing

Bypassing these tests doesn't seem appropriate. I'm closing the issue.

agocke avatar Jun 18 '24 03:06 agocke