runtime icon indicating copy to clipboard operation
runtime copied to clipboard

nativeaot/SmokeTests/Exceptions failing with `Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined()`

Open elinor-fung opened this issue 1 year ago • 3 comments
trafficstars

Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined(), file D:\a\_work\1\s\src\coreclr\gc\gc.cpp, line 6988

Return code:      1
Raw output file:      C:\h\w\B51009A0\w\B29A098A\uploads\Reports\nativeaot.SmokeTests\Exceptions\Exceptions\Exceptions.output.txt
Raw output:
BEGIN EXECUTION
call C:\h\w\B51009A0\p\nativeaottest.cmd C:\h\w\B51009A0\w\B29A098A\e\nativeaot\SmokeTests\Exceptions\Exceptions\ Exceptions.dll 
Exception caught!
Null reference exception in write barrier caught!
Null reference exception caught!
Test Stacktrace with exception on stack:
   at BringUpTest.FilterWithStackTrace(Exception) + 0x28
   at BringUpTest.Main() + 0x31c
   at System.Runtime.EH.FindFirstPassHandler(Object, UInt32, StackFrameIterator&, UInt32&, Byte*&) + 0x188
   at System.Runtime.EH.DispatchEx(StackFrameIterator&, EH.ExInfo&) + 0x161
   at System.Runtime.EH.RhThrowEx(Object, EH.ExInfo&) + 0x4b
   at BringUpTest.Main() + 0xaf

Exception caught via filter!
Expected: 100
Actual: 3
END EXECUTION - FAILED

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=715849 Build error leg or test failing: nativeaot\SmokeTests\Exceptions\Exceptions\Exceptions.cmd Pull request: https://github.com/dotnet/runtime/pull/103821

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined()",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Build Definition Test Pull Request
722197 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#103801
715849 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#103821

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 1 2

elinor-fung avatar Jun 21 '24 20:06 elinor-fung

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Looks like a DATAs race condition. @dotnet/gc Could you please take a look?

Note that nativeaot\SmokeTests\Exceptions test is explicitly opted into server GC to get some coverage for server GC during default CI run.

jkotas avatar Jun 22 '24 06:06 jkotas

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

mrsharm avatar Jul 05 '24 19:07 mrsharm

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

Yeah, it doesn't look like infra captured a dump for this.

There are 4 hits per month but we don't have any dedicated server GC testing. This is the one and only test we run with server GC enabled. We rely on CoreCLR testing to catch GC bugs right now (even this test is not really testing Server GC - it just tests that setting the csproj property to enable server GC actually enables the server GC).

MichalStrehovsky avatar Jul 09 '24 10:07 MichalStrehovsky

@mrsharm @MichalStrehovsky are any dumps available for this, or is there a local repro?

mangod9 avatar Aug 09 '24 20:08 mangod9

I couldn't locally repro this and nor could I get to any dumps. My one guess (by a long shot) is that this might be related to the other DATAS race condition we found via Reliability Framework where there is a race in the GetHeap while change_heap_count is invoked but without a dump it's difficult to validate.

mrsharm avatar Aug 09 '24 20:08 mrsharm

The reliability framework issue was fixed correct? Looks like this issue reproed today.

mangod9 avatar Aug 09 '24 21:08 mangod9

The reliability framework issue was fixed correct? Looks like this issue reproed today.

It wasn't - I think we were still working on a solution. CC: @Maoni0.

mrsharm avatar Aug 09 '24 21:08 mrsharm

ah ok. We can tag it as such then, and see if the repro stops after that is fixed.

mangod9 avatar Aug 09 '24 22:08 mangod9

I made a fix at https://github.com/dotnet/runtime/pull/106752.

Maoni0 avatar Aug 21 '24 09:08 Maoni0

image

@cshung, we should wait some time before confirming this issue has truly fixed - I am observing that the bot is still picking up the same failures.

mrsharm avatar Aug 22 '24 18:08 mrsharm

@mrsharm, wouldn't the bot reopen it if it finds new failures? I was hoping to confirm the fix by doing that. The builds found from the bot seems to be either 9.0 or two days ago.

cshung avatar Aug 22 '24 18:08 cshung