runtime
runtime copied to clipboard
Dead lettering tests
Build Information
Build: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910
Build error leg or test failing: browser-wasm windows Release LibraryTests_Smoke_AOT
Error Message
{
"ErrorMessage" : "If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.",
"BuildRetry" : false,
"ExcludeConsoleLog" : false
}
- PR: https://github.com/dotnet/runtime/pull/101498
- Queue:
net9.0-browser-Release-wasm-Mono_Release-WasmTestOnV8 - Job result: https://dev.azure.com/dnceng-public/public/public%20Team/_build/results?buildId=654910&view=logs&j=0eae1e18-ea37-5071-1d48-9384b3d4e672&t=7c50f1ca-0ae5-5055-6835-fe157eafc276&l=59
- Log file: https://helix.dot.net/api/2019-06-17/jobs/528aca7b-21fa-4ecf-8f47-4fa3e533bc93/workitems/WasmTestOnV8-System.Runtime.Tests/console
- Output:
If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.
What this means:
- All attempts to retry execution of this work item were unable to complete. This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).
Common causes:
- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.
For follow up:
- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).
Report
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 13 | 125 | 479 |
Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.
In the same PR's build where the above dead-letter failure was found, there's another run for a similar queue but is Release. While it does not dead-letter immediately, it manages to print a couple lines, then dies:
- Build error leg:
browser-wasm windows Release LibraryTests_Smoke_AOT - Queue:
net9.0-browser-Release-wasm-Mono_Release-WasmTestOnChrome- - Job result: https://dev.azure.com/dnceng-public/public/_build/results?buildId=654910&view=logs&j=bf575327-8861-5da8-d33e-e12ac4086c09&t=c85eeb3c-de58-5b19-fbd6-7d6b5ac3e88c
- Log file: https://helix.dot.net/api/2019-06-17/jobs/2736decc-83a1-4c32-9e0a-ef543e0d26f3/workitems/WasmTestOnChrome-System.Runtime.Tests/console
- Output:
Console log: 'WasmTestOnChrome-System.Runtime.Tests' from job 2736decc-83a1-4c32-9e0a-ef543e0d26f3 (windows.amd64.server2022.open.rt) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:windowsservercore-ltsc2022-helix-webassembly on a000NF0
running %HELIX_CORRELATION_PAYLOAD%\scripts\be8b1ad5c1e9498d89709f26e508c549\execute.cmd in C:\h\w\B0E6099F\w\A1EF089A\e max 3600 seconds
^ It just dies after printing the second line.
I do not want to open a KnownBuildError issue for this specific failure as it would end up grouping anything. I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself: https://github.com/dotnet/runtime/pull/101498 @JulieLeeMSFT @hoyosjs @jkoritzinsky
I am concerned that people are going to get blocked on getting their PRs merged because they cannot bypass the merge on green restriction. For example, this PR is already blocked and I will only be able to merge it if I JIT elevate myself:
People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?
People should be able to use https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/failure-analysis.md#bypassing-build-analysis . Have you tried that before JIT elevating?
I was not aware of that. Thanks for sharing! I'll try it next time.
What does dead-lettering mean in this context? What is the case where this fails?
iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?
We often see queues fall over around branch time
If you’re reading this, that means the Helix work item you’re trying to find the logs for has dead-lettered.
What this means:
- All attempts to retry execution of this work item were unable to complete. This can be both for infrastructure reasons (problems within Azure) or issues with the work item (for instance, causing a machine to reboot unexpectedly or killing the Helix client on the machine will force a retry).
- No further work will be done for this specific work item, and its exit code is set to an artificial -1 (since it did not complete, there is no real exit code).
Common causes:
- Disabled queue (end-of-life Helix queues are automatically forwarded to deadletter and will fail instantly)
- Unhealthy Helix Client machine(s)
- Queue has been backed up heavily by a large amount of work and was manually purged by the engineering team
- Azure issues (e.g. Service Bus is overloaded)
- Malformed payloads; if Helix cannot download and unzip all payloads successfully, work will retry until dead-lettered.
For follow up:
- Check if your Helix Queue is still enabled, either via the metadata you see by browsing to https://helix.dot.net/api/info/queues?api-version=2019-06-17 or recent emails from the .NET Engineering Infrastructure team.
- Check that all work item payloads are accessible using a browser.
- If you are sending to a non-disabled queue and find this error repeatedly occurring, please contact the dnceng team.
- If a single, specific work item dead letters and others do not, consider local debugging; it may be causing spontaneous reboot (or trigging one intentionally).
iirc Dead Lettering is usually infrastructure related, is that correct @steveisok ?
We often see queues fall over around branch time
Correct. @ilyas1974 is queue dead lettering manually driven or is there some automation involved?
Tagging subscribers to this area: @dotnet/runtime-infrastructure See info in area-owners.md if you want to be subscribed.
For you dead lettering question, the answer is Yes - it's a manual and automated process. We manually deadletter a queue so the changes are immediate. We then add the deadletter information to the helix configuration, so he is persistent for whenever we make changes to helix.
Ok, so is the expectation that tests should be re-run when deadlettering happens? That basically, that run was invalid?
I've removed the wasm references in the labels and title bits because wasm is no longer dominating the failures in any way (with the exception of preview4 which has known problems that are fixed in main)
Bypassing these tests doesn't seem appropriate. I'm closing the issue.