runtime
runtime copied to clipboard
Assert failure(PID 13812 [0x000035f4], Thread: 14128 [0x3730]): promoted_bytes (heap_number) == promoted
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=610011 Build error leg or test failing: GC\API\NoGCRegion\Callback_Svr\Callback_Svr.cmd Pull request: https://github.com/dotnet/runtime/pull/99992
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "promoted_bytes (heap_number) == promoted",
"ErrorPattern": "",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
FYI @dotnet/gc
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=610011
Error message validated: [promoted_bytes (heap_number) == promoted]
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 3/20/2024 7:47:36 PM UTC
Report
| Build | Definition | Test | Pull Request |
|---|---|---|---|
| 636085 | dotnet/runtime | PayloadGroup0.WorkItemExecution | dotnet/runtime#100810 |
| 628430 | dotnet/runtime | PayloadGroup0.WorkItemExecution | dotnet/runtime#100575 |
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 0 | 0 | 2 |
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries See info in area-owners.md if you want to be subscribed.
@markples since we were chatting about incorrect promoted bytes reporting today. Perhaps this is related?
The other one occurs when a normal GC interferes with the bookkeeping of a BGC and leads to a BGC assertion. This isn't doing that, but perhaps the reverse could occur.
Adding @Maoni0 as well in case this is related to any recent change.
@dotnet/runtime-infrastructure there are many CI job failures for this, but none has a dump attached. It did collect a stack trace locally, so not sure why the dump has not been uploaded?
@mangod9 @markples Can you disable this test today so pipeline runs are clean, and in particular, so all our outerloop pipelines don't have broad failures this weekend?
#100167
This appears to be a compiler bug on x64 windows. dprintfs are nops in this build. The relevant asm is extracted (way) below.
C++ (manually preprocessed a bit):
inline
size_t gc_heap::get_promoted_bytes()
{
if (!survived_per_region)
{
dprintf (...);
return 0;
}
dprintf (...);
size_t promoted = 0;
for (size_t i = 0; i < region_count; i++)
{
if (survived_per_region[i] > 0)
{
heap_segment* region = get_region_at_index (i);
dprintf (...)
promoted += survived_per_region[i];
}
}
dprintf (...);
assert (promoted_bytes (heap_number) == promoted);
return promoted;
Full asm:
00007FFB1FBF51BF mov r8,qword ptr [SVR::gc_heap::region_count (07FFB202188A8h)]
00007FFB1FBF51C6 mov edi,ecx
00007FFB1FBF51C8 mov r12d,dword ptr [__isa_available (07FFB201FD958h)]
00007FFB1FBF51CF nop
{
promoted_bytes_global += g_heaps[i]->get_promoted_bytes();
00007FFB1FBF51D0 mov rax,qword ptr [SVR::gc_heap::g_heaps (07FFB20219198h)]
00007FFB1FBF51D7 mov rbx,rcx
00007FFB1FBF51DA mov r10,qword ptr [rdi+rax]
00007FFB1FBF51DE mov r9,qword ptr [r10+540h]
00007FFB1FBF51E5 test r9,r9
00007FFB1FBF51E8 je SVR::gc_heap::mark_phase+168Ch (07FFB1FBF52CCh)
00007FFB1FBF51EE mov rdx,rcx
00007FFB1FBF51F1 test r8,r8
00007FFB1FBF51F4 je SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)
00007FFB1FBF51F6 cmp r8,4
00007FFB1FBF51FA jb SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)
00007FFB1FBF51FC cmp r12d,6
00007FFB1FBF5200 jl SVR::gc_heap::mark_phase+1628h (07FFB1FBF5268h)
00007FFB1FBF5202 mov rax,r8
00007FFB1FBF5205 xorps xmm3,xmm3
00007FFB1FBF5208 and rax,0FFFFFFFFFFFFFFFCh
00007FFB1FBF520C xorps xmm4,xmm4
00007FFB1FBF520F xorps xmm2,xmm2
00007FFB1FBF5212 movdqu xmm0,xmmword ptr [r9+rdx*8]
00007FFB1FBF5218 vpminuq xmm1,xmm0,xmm4
00007FFB1FBF521E pcmpeqq xmm1,xmm0
00007FFB1FBF5223 andnps xmm1,xmm6
00007FFB1FBF5226 andnps xmm1,xmm0
00007FFB1FBF5229 movdqu xmm0,xmmword ptr [r9+rdx*8+10h]
00007FFB1FBF5230 add rdx,4
00007FFB1FBF5234 paddq xmm3,xmm1
00007FFB1FBF5238 vpminuq xmm1,xmm0,xmm4
00007FFB1FBF523E pcmpeqq xmm1,xmm0
00007FFB1FBF5243 andnps xmm1,xmm6
00007FFB1FBF5246 andnps xmm1,xmm0
00007FFB1FBF5249 paddq xmm2,xmm1
00007FFB1FBF524D cmp rdx,rax
00007FFB1FBF5250 jb SVR::gc_heap::mark_phase+15D2h (07FFB1FBF5212h)
00007FFB1FBF5252 paddq xmm2,xmm3
00007FFB1FBF5256 movdqa xmm0,xmm2
00007FFB1FBF525A psrldq xmm0,8
00007FFB1FBF525F paddq xmm2,xmm0
00007FFB1FBF5263 movq rbx,xmm2
00007FFB1FBF5268 cmp rdx,r8
00007FFB1FBF526B jae SVR::gc_heap::mark_phase+164Ah (07FFB1FBF528Ah)
00007FFB1FBF526D nop dword ptr [rax]
00007FFB1FBF5270 mov rax,qword ptr [r9+rdx*8]
00007FFB1FBF5274 test rax,rax
00007FFB1FBF5277 lea rcx,[rbx+rax]
00007FFB1FBF527B cmove rcx,rbx
00007FFB1FBF527F inc rdx
00007FFB1FBF5282 mov rbx,rcx
00007FFB1FBF5285 cmp rdx,r8
00007FFB1FBF5288 jb SVR::gc_heap::mark_phase+1630h (07FFB1FBF5270h)
00007FFB1FBF528A mov eax,dword ptr [r10+1598h]
00007FFB1FBF5291 shl eax,4
00007FFB1FBF5294 movsxd rcx,eax
00007FFB1FBF5297 mov rax,qword ptr [SVR::gc_heap::g_promoted (07FFB20218808h)]
00007FFB1FBF529E cmp qword ptr [rax+rcx*8],rbx
00007FFB1FBF52A2 je SVR::gc_heap::mark_phase+168Ah (07FFB1FBF52CAh)
00007FFB1FBF52A4 lea r8,[string "promoted_bytes (heap_number) ==@"... (07FFB2008B450h)]
00007FFB1FBF52AB mov edx,5EAAh
00007FFB1FBF52B0 lea rcx,[string "D:\\a\\_work\\1\\s\\src\\coreclr\\gc\\g@"... (07FFB20088190h)]
00007FFB1FBF52B7 call DbgAssertDialog (07FFB1FCEE980h)
Relevant asm:
00007FFB1FBF5205 xorps xmm3,xmm3
00007FFB1FBF5208 and rax,0FFFFFFFFFFFFFFFCh
00007FFB1FBF520C xorps xmm4,xmm4
00007FFB1FBF520F xorps xmm2,xmm2
// loop
00007FFB1FBF5212 movdqu xmm0,xmmword ptr [r9+rdx*8]
00007FFB1FBF5218 vpminuq xmm1,xmm0,xmm4
00007FFB1FBF521E pcmpeqq xmm1,xmm0
00007FFB1FBF5223 andnps xmm1,xmm6
00007FFB1FBF5226 andnps xmm1,xmm0
00007FFB1FBF5229 movdqu xmm0,xmmword ptr [r9+rdx*8+10h]
00007FFB1FBF5230 add rdx,4
00007FFB1FBF5234 paddq xmm3,xmm1
00007FFB1FBF5238 vpminuq xmm1,xmm0,xmm4
00007FFB1FBF523E pcmpeqq xmm1,xmm0
00007FFB1FBF5243 andnps xmm1,xmm6
00007FFB1FBF5246 andnps xmm1,xmm0
00007FFB1FBF5249 paddq xmm2,xmm1
00007FFB1FBF524D cmp rdx,rax
00007FFB1FBF5250 jb SVR::gc_heap::mark_phase+15D2h (07FFB1FBF5212h)
If I'm reading it correctly, the minimum is always going to be zero due to xmm4 being zero. The compare will generate all 1s if the array value is zero, all 0s otherwise. I haven't found the definition of xmm6 (but from other examples is likely to be all 1s) or what the double and-not is doing. Perhaps replacing both of those with andnps xmm1 xmm0 is the intention? (though also note that all of this is to prevent the addition of zero into the sum)
[edit: details in last paragraph]
adding a bit of detail why we suspected this was a compiler issue to begin with - we could see the 2 values were indeed the same if the calculation in the loop was done correctly. I looked at multiple repros and if I added up what's in survived_per_region for that heap for all regions (which would be the value of promoted, it was always the same as promoted_bytes for that heap. an example from the dump here -
0:006> ?? coreclr!SVR::gc_heap::g_heaps[1]->survived_per_region
unsigned int64 * 0x000001fe`43dac1f0
0:006> dq 01fe`43dac1f0 01fe`43dac1f0+2a*8 <-- coreclr!SVR::gc_heap::region_count is always 2a
000001fe`43dac1f0 00000000`00000000 00000000`00000000
000001fe`43dac200 00000000`00000000 00000000`00000000
000001fe`43dac210 00000000`00000000 00000000`00000000
000001fe`43dac220 00000000`00000000 00000000`00000000
000001fe`43dac230 00000000`00000000 00000000`00000000
000001fe`43dac240 00000000`00000000 00000000`00000000
000001fe`43dac250 00000000`00000000 00000000`00000000
000001fe`43dac260 00000000`00000000 00000000`00000000
000001fe`43dac270 00000000`00000000 00000000`00000000
000001fe`43dac280 00000000`00000000 00000000`00000000
000001fe`43dac290 00000000`00000000 00000000`00000450 <-- in this particular case this is the only non-zero value, in some cases there are multiple regions with non-zero values.
000001fe`43dac2a0 00000000`00000000 00000000`00000000
000001fe`43dac2b0 00000000`00000000 00000000`00000000
000001fe`43dac2c0 00000000`00000000 00000000`00000000
000001fe`43dac2d0 00000000`00000000 00000000`00000000
000001fe`43dac2e0 00000000`00000000 00000000`00000000
000001fe`43dac2f0 00000000`00000000 00000000`00000000
000001fe`43dac300 00000000`00000000 00000000`00000000
000001fe`43dac310 00000000`00000000 00000000`00000000
000001fe`43dac320 00000000`00000000 00000000`00000000
000001fe`43dac330 00000000`00000000 00000000`00000000
000001fe`43dac340 00000000`00000000
0:006> x coreclr!*g_promoted
00007ffe`b652efb8 coreclr!WKS::gc_heap::g_promoted = 0
00007ffe`b6528808 coreclr!SVR::gc_heap::g_promoted = 0x000001bd`aca8e860
0:006> dq 01bd`aca8e860+10*8 l1
000001bd`aca8e8e0 00000000`00000450
but in the debugger promoted is 0. when it does this compare
cmp qword ptr [rax+rcx*8],rbx
rbx which is promoted is 0, so the assert was triggered.
Failed in: runtime-coreclr jitstress 20240321.2
Failed tests:
coreclr windows x64 Checked disabler2r @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress2 @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress1_tiered @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress2_tiered @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitminopts @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked jitstress1 @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
coreclr windows x64 Checked tailcallstress @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
Error message:
Assert failure(PID 17192 [0x00004328], Thread: 17220 [0x4344]): promoted_bytes (heap_number) == promoted
CORECLR! SVR::gc_heap::mark_phase + 0x167C (0x00007ffe`2f264ebc)
CORECLR! SVR::gc_heap::gc1 + 0x240 (0x00007ffe`2f256c40)
CORECLR! SVR::gc_heap::garbage_collect + 0x99B (0x00007ffe`2f2569ab)
CORECLR! SVR::gc_heap::gc_thread_function + 0x504 (0x00007ffe`2f258354)
CORECLR! SVR::gc_heap::gc_thread_stub + 0x7E (0x00007ffe`2f2587fe)
CORECLR! <lambda_db5353710d144cc7c2d19ec9a78e4018>::<lambda_invoker_cdecl> + 0x74 (0x00007ffe`2ef1fe04)
KERNEL32! BaseThreadInitThunk + 0x14 (0x00007ffe`600e84d4)
NTDLL! RtlUserThreadStart + 0x21 (0x00007ffe`601e1791)
File: D:\a\_work\1\s\src\coreclr\gc\gc.cpp:24234
Image: C:\h\w\B78B099A\p\corerun.exe
Return code: 1
Raw output file: C:\h\w\B78B099A\w\B19C09A9\uploads\Reports\GC.API\NoGCRegion\Callback_Svr\Callback_Svr.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\B78B099A\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true" Callback_Svr.dll
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test failed. Trying to see if dump file was created in C:\cores since 3/22/2024 6:04:51 AM
Processing C:\cores\corerun.exe.17192.dmp
Invoking: C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe -c "$<C:\h\w\B78B099A\t\tmpmuespb.tmp" -z "C:\cores\corerun.exe.17192.dmp"
stdout:
Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\cores\corerun.exe.17192.dmp]
User Mini Dump File with Full Memory: Only application data is available
************* Path validation summary **************
Response Time (ms) Location
OK C:\h\w\B78B099A\p\PDB
Symbol search path is: C:\h\w\B78B099A\p\PDB
Executable search path is:
Windows 10 Version 14393 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
10.0.14393.6343 (rs1_release.230913-1727)
Machine Name:
Debug session time: Fri Mar 22 06:04:56.000 2024 (UTC + 0:00)
System Uptime: 0 days 0:26:43.404
Process Uptime: 0 days 0:00:04.000
..............................
************* Symbol Loading Error Summary **************
Module name Error
ntdll The system cannot find the file specified
You can troubleshoot most symbol related issues by turning on symbol loading diagnostics (!sym noisy) and repeating the command that caused symbols to be loaded.
You should also verify that your symbol search path (.sympath) is correct.
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(4328.4344): Unknown exception - code c0000602 (first/second chance not available)
For analysis of this file, run !analyze -v
*** WARNING: Unable to verify checksum for coreclr.dll
KERNELBASE!RaiseFailFastException+0xae:
00007ffe`5d5d96ce 803d4d78120000 cmp byte ptr [KERNELBASE!local_unwind+0x7fe2 (00007ffe`5d700f22)],0 ds:00007ffe`5d700f22=00
0:006> cdb: Reading initial command '$<C:\h\w\B78B099A\t\tmpmuespb.tmp'
0:006> .load C:\Users\runner\.dotnet\sos\sos.dll
0:006> ~*k
0 Id: 4328.432c Suspend: 0 Teb: 000000de`0e18d000 Unfrozen
Child-SP RetAddr Call Site
000000de`0e37db48 00007ffe`5d536d1f ntdll!ZwWaitForSingleObject+0x14
000000de`0e37db50 00007ffe`2f27efc2 KERNELBASE!WaitForSingleObjectEx+0x8f
000000de`0e37dbf0 00007ffe`2f234929 coreclr!SVR::gc_heap::wait_for_gc_done+0x1b2
000000de`0e37dc50 00007ffe`2f27a921 coreclr!SVR::GCHeap::GarbageCollectGeneration+0x359
000000de`0e37dca0 00007ffe`2f233430 coreclr!SVR::gc_heap::trigger_gc_for_alloc+0x71
(Inline Function) --------`-------- coreclr!SVR::gc_heap::try_allocate_more_space+0xca
(Inline Function) --------`-------- coreclr!SVR:
Stack trace:
at GC_API._NoGCRegion_Callback_Svr_Callback_Svr_._NoGCRegion_Callback_Svr_Callback_Svr_cmd()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
More failures: runtime-coreclr crossgen2 20240321.1
R2R-CG2 windows x64 Checked no_tiered_compilation @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
R2R-CG2 windows x64 Checked @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
runtime-coreclr pgo 20240322.1
coreclr windows x64 Checked defaultpgo @ Windows.10.Amd64.Open
- GC\\API\\NoGCRegion\\Callback_Svr\\Callback_Svr.cmd
This is having fairly widespread impact, would appreciate it if we could get some kind of mitigation in place soon.
Did https://github.com/dotnet/runtime/pull/100167 not work?
Oh, it looks like https://github.com/dotnet/runtime/issues/100217 is the same symptom on a different test.
yeah we will need to workaround by disabling optimizations for this method in the interim. Very likely this was triggered by the compiler update from last week.
This affects servicing branches as well. Seen in a 8.0-staging PR:
https://github.com/dotnet/runtime/pull/100426
we plan to move to Release version of the toolset. PR for backport: https://github.com/dotnet/runtime/pull/100622
removing blocking-clean-ci as it has not failed in 30 days
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 0 | 0 | 0 |
fixed - see #100598 and #101995