runtime
runtime copied to clipboard
Segmentation fault in System.Text.RegularExpressions.Tests
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=431489 Build error leg or test failing: Libraries Test Run release coreclr osx x64 Release Pull request: N/A
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "",
"ErrorPattern": "Segmentation fault.*System.Text.RegularExpressions.Tests",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=431489
Error message validated: Segmentation fault.*System.Text.RegularExpressions.Tests
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 10/9/2023 11:24:45 AM UTC
Report
Summary
24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
---|---|---|
3 | 14 | 68 |
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions See info in area-owners.md if you want to be subscribed.
Issue Details
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=431489 Build error leg or test failing: Libraries Test Run release coreclr osx x64 Release Pull request: N/A
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "",
"ErrorPattern": "Segmentation fault.*System.Text.RegularExpressions.Tests",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Author: | akoeplinger |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | - |
Looks similar to https://github.com/dotnet/runtime/issues/85046
Some detail from the logs; no dump:
Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.Text.RegularExpressions.Tests (found 329 of 357 test cases)
Starting: System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 4)
./RunTests.sh: line 204: 48718 Segmentation fault: 11 "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json --depsfile System.Text.RegularExpressions.Tests.deps.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/B0D30970/w/B6A909D4/e
----- end Mon Oct 9 04:57:38 EDT 2023 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.
ulimit -c value: 0
+ export _commandExitCode=139
+ _commandExitCode=139
+ /usr/local/bin/python3 /tmp/helix/working/B0D30970/p/reporter/run.py https://dev.azure.com/dnceng-public/ public 9545048 eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Im9PdmN6NU1fN3AtSGpJS2xGWHo5M3VfVjBabyJ9.eyJuYW1laWQiOiJjNzczZjJjMi01MTIwLTQyMDctYWZlMi1hZmFmMzVhOGJjMGEiLCJzY3AiOiJhcHBfdG9rZW4iLCJhdWkiOiJmYjRjZWM4Ny00MzBlLTQ5MjctYWY5Zi0wODEyMDY0ZjliNTEiLCJzaWQiOiJkNDQ0NDU2ZS0wMGFkLTQxODktOTZiZi03NTc5YzhkNDFhMzIiLCJCdWlsZElkIjoiY2JiMTgyNjEtYzQ4Zi00YWJiLTg2NTEtOGNkY2I1NDc0NjQ5OzQzMTQ4OSIsImpvYnJlZiI6IjRlNzhkMDdhLTU0MDQtNDA0Yy04NWQ0LWYwMGYwMjU0OTc1Zjo3ZTk4NjRhNi03ZmVjLTVjZDQtZmEwMS01OGQ3NmEwMzU4ZjkiLCJwcGlkIjoidnN0ZnM6Ly8vQnVpbGQvQnVpbGQvNDMxNDg5Iiwib3JjaGlkIjoiNGU3OGQwN2EtNTQwNC00MDRjLTg1ZDQtZjAwZjAyNTQ5NzVmLmJ1aWxkLmxpYnJhcmllc190ZXN0X3J1bl9yZWxlYXNlX2NvcmVjbHJfb3N4X3g2NF9yZWxlYXNlLl9fZGVmYXVsdCIsInJlcG9JZHMiOiIiLCJpc3MiOiJhcHAudnN0b2tlbi52aXN1YWxzdHVkaW8uY29tIiwiYXVkIjoiYXBwLnZzdG9rZW4udmlzdWFsc3R1ZGlvLmNvbXx2c286NmZjYzkyZTUtNzNhNy00Zjg4LThkMTMtZDkwNDViNDVmYjI3IiwibmJmIjoxNjk2ODQwNDc2LCJleHAiOjE2OTY4NTA2NzZ9.IVBKJCxDEmXAR-ltcWk6a08SCMG-owMAPvaBhq_6BAtitZNAJPwChwcwtCqIv5sOeUCkQYfSmU6UD1OWMSxeUeY_nFlukS-q4eD9X_HJaz5rICRMFcmO3u884rtGgHef6YAaojCw894W-rBndNvV2mZ-cso9BmfPOYZMoYpO3pQfeCZoaRq3Im1VsJY26_W0rExBk8asLIjuEKDlk7LjYA7kOF61uh3Qy5fQRJtQkDDaJY9PEExvCBneeRf5cnoARNncRi9z3mpHiqsIZ6W2gyMP6uff5v0OG5GFu-AJkVHXgREL3pw935WTLMKFMORsc8lD5MqYryve3RCOTw0CZA
2023-10-09T08:57:46.052Z INFO run.py run(48) main Beginning reading of test results.
2023-10-09T08:57:46.054Z INFO run.py __init__(42) read_results Searching '/private/tmp/helix/working/B0D30970/w/B6A909D4/e' for test results files
2023-10-09T08:57:46.056Z INFO run.py __init__(42) read_results Searching '/tmp/helix/working/B0D30970/w/B6A909D4/uploads' for test results files
2023-10-09T08:57:46.057Z WARNING run.py __init__(55) read_results No results file found in any of the following formats: xunit, junit, trx
2023-10-09T08:57:46.058Z INFO run.py packing_test_reporter(30) report_results Packing 0 test reports to '/tmp/helix/working/B0D30970/w/B6A909D4/e/__test_report.json'
2023-10-09T08:57:46.058Z INFO run.py packing_test_reporter(33) report_results Packed 1553 bytes
+ /usr/local/bin/python3 /tmp/helix/working/B0D30970/p/gen-debug-dump-docs.py -buildid 431489 -workitem System.Text.RegularExpressions.Tests -jobid 5472e3d3-beb1-49db-af99-5d3100d2a736 -outdir /tmp/helix/working/B0D30970/w/B6A909D4/uploads -templatedir /tmp/helix/working/B0D30970/p -dumpdir /cores -productver 9.0.0
Did not find dumps, skipping dump docs generation.
+ exit 139
['System.Text.RegularExpressions.Tests' END OF WORK ITEM LOG: Command exited with 139]
Both this and https://github.com/dotnet/runtime/issues/85046 occurred on OSX.
Without a dump it'll be impossible to make progress on this. It's also very unlikely to be in regex itself, and much more likely to be an issue either in the span-related functionality regex sits on top of, or in codegen / the runtime.
I was able to capture a core dump on my local Mac using the Helix artifacts, but it is 8GB so uploading will take a while :)
Here's the coredump compressed with 7z: https://microsofteur-my.sharepoint.com/:u:/g/personal/alkpli_microsoft_com/Ed36-eUF0PZOm-1hEL6QVwMBNRoTbiyBIgX5sh9dY6WR6Q?e=fGpWUP
This was from the artifacts from Helix job 5472e3d3-beb1-49db-af99-5d3100d2a736.
bt all
from lldb: https://gist.github.com/akoeplinger/73cda3c6fa725e18d0f3fbc25c929ca6
Let me know if you need anything else.
I was able to capture a core dump on my local Mac using the Helix artifacts
Cool. Is there a crashlog (.crash) file that you can run lldb crashlog against?
This looks suspect however:
thread #15
frame #0: 0x000000010eceff91 libcoreclr.dylib`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] WKS::mark_queue_t::queue_mark(this=<unavailable>, o="p\x9f\x8c\U0000001d\U00000001") at gc.cpp:26791:9 [opt]
frame #1: 0x000000010eceff8e libcoreclr.dylib`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] WKS::mark_queue_t::queue_mark(this=<unavailable>, o="p\x9f\x8c\U0000001d\U00000001", condemned_gen=-1) at gc.cpp:26829:16 [opt]
frame #2: 0x000000010eceff7c libcoreclr.dylib`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:27476:17 [opt]
frame #3: 0x000000010ecf2b0b libcoreclr.dylib`WKS::GCHeap::Promote(ppObject=0x0000700006b773a8, sc=<unavailable>, flags=0) at gc.cpp:48915:5 [opt]
frame #4: 0x000000010ec6a7ae libcoreclr.dylib`GcInfoDecoder::ReportUntrackedSlots(GcSlotDecoder&, REGDISPLAY*, unsigned int, void (*)(void*, Object**, unsigned int), void*) [inlined] GcInfoDecoder::ReportSlotToGC(this=0x0000700006860708, slotDecoder=0x0000700006860310, slotIndex=173, pRD=0x0000700006860d90, reportScratchSlots=true, inputFlags=<unavailable>, pCallBack=(libcoreclr.dylib`GcEnumObject(void*, Object**, unsigned int) at gcenv.ee.common.cpp:147), hCallBack=0x00007000068634b0) at gcinfodecoder.h:0 [opt]
frame #5: 0x000000010ec6a6fc libcoreclr.dylib`GcInfoDecoder::ReportUntrackedSlots(this=0x0000700006860708, slotDecoder=0x0000700006860310, pRD=0x0000700006860d90, inputFlags=<unavailable>, pCallBack=(libcoreclr.dylib`GcEnumObject(void*, Object**, unsigned int) at gcenv.ee.common.cpp:147), hCallBack=0x00007000068634b0) at gcinfodecoder.cpp:1040:9 [opt]
frame #6: 0x000000010ec695e5 libcoreclr.dylib`GcInfoDecoder::EnumerateLiveSlots(this=<unavailable>, pRD=<unavailable>, reportScratchSlots=<unavailable>, inputFlags=<unavailable>, pCallBack=<unavailable>, hCallBack=<unavailable>) at gcinfodecoder.cpp:989:9 [opt]
frame #7: 0x000000010ea972cf libcoreclr.dylib`EECodeManager::EnumGcRefs(this=<unavailable>, pRD=0x0000700006860d90, pCodeInfo=0x0000700006860c10, flags=0, pCallBack=(libcoreclr.dylib`GcEnumObject(void*, Object**, unsigned int) at gcenv.ee.common.cpp:147), hCallBack=0x00007000068634b0, relOffsetOverride=4294967295) at eetwain.cpp:5336:24 [opt]
frame #8: 0x000000010eba8353 libcoreclr.dylib`GcStackCrawlCallBack(pCF=0x00007000068609e0, pData=0x00007000068634b0) at gcenv.ee.common.cpp:282:18 [opt]
frame #9: 0x000000010eb269f5 libcoreclr.dylib`Thread::MakeStackwalkerCallback(this=0x00007f92ae04b000, pCF=0x00007000068609e0, pCallback=(libcoreclr.dylib`GcStackCrawlCallBack(CrawlFrame*, void*) at gcenv.ee.common.cpp:200), pData=0x00007000068634b0) at stackwalk.cpp:847:27 [opt]
frame #10: 0x000000010eb26c4a libcoreclr.dylib`Thread::StackWalkFramesEx(this=0x00007f92ae04b000, pRD=0x0000700006860d90, pCallback=(libcoreclr.dylib`GcStackCrawlCallBack(CrawlFrame*, void*) at gcenv.ee.common.cpp:200), pData=0x00007000068634b0, flags=34048, pStartFrame=0x0000000000000000) at stackwalk.cpp:927:26 [opt]
frame #11: 0x000000010eb27084 libcoreclr.dylib`Thread::StackWalkFrames(this=0x00007f92ae04b000, pCallback=(libcoreclr.dylib`GcStackCrawlCallBack(CrawlFrame*, void*) at gcenv.ee.common.cpp:200), pData=0x00007000068634b0, flags=34048, pStartFrame=0x0000000000000000) at stackwalk.cpp:1010:12 [opt]
frame #12: 0x000000010eba5285 libcoreclr.dylib`ScanStackRoots(pThread=0x00007f92ae04b000, fn=(libcoreclr.dylib`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:48849), sc=0x0000700006863588) at gcenv.ee.cpp:204:18 [opt]
frame #13: 0x000000010eba5099 libcoreclr.dylib`GCToEEInterface::GcScanRoots(fn=(libcoreclr.dylib`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:48849), condemned=1, max_gen=2, sc=0x0000700006863588) at gcenv.ee.cpp:303:13 [opt]
frame #14: 0x000000010ece4c1a libcoreclr.dylib`WKS::gc_heap::mark_phase(condemned_gen_number=1) at gc.cpp:29358:9 [opt]
frame #15: 0x000000010ece1306 libcoreclr.dylib`WKS::gc_heap::gc1() at gc.cpp:22324:13 [opt]
frame #16: 0x000000010ececcad libcoreclr.dylib`WKS::gc_heap::garbage_collect(n=0) at gc.cpp:0:21 [opt]
frame #17: 0x000000010ecdbc75 libcoreclr.dylib`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:50393:9 [opt]
frame #18: 0x000000010ecdddf9 libcoreclr.dylib`WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, unsigned int, int) [inlined] WKS::gc_heap::trigger_gc_for_alloc(gen_number=0, gr=<unavailable>, msl=0x000000010ef2e548, loh_p=<unavailable>, take_state=<unavailable>) at gc.cpp:18920:14 [opt]
frame #19: 0x000000010ecdddf2 libcoreclr.dylib`WKS::gc_heap::try_allocate_more_space(acontext=0x00007f92ae824658, size=64, flags=2, gen_number=0) at gc.cpp:19058:34 [opt]
frame #20: 0x000000010ed08f50 libcoreclr.dylib`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) [inlined] WKS::gc_heap::allocate_more_space(acontext=0x00007f92ae824658, size=64, flags=2, alloc_generation_number=0) at gc.cpp:19558:18 [opt]
frame #21: 0x000000010ed08f35 libcoreclr.dylib`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) at gc.cpp:19589:19 [opt]
frame #22: 0x000000010ed08f1a libcoreclr.dylib`WKS::GCHeap::Alloc(this=<unavailable>, context=0x00007f92ae824658, size=64, flags=2) at gc.cpp:49327:34 [opt]
frame #23: 0x000000010eba8aa3 libcoreclr.dylib`Alloc(size=64, flags=GC_ALLOC_CONTAINS_REF) at gchelpers.cpp:227:48 [opt]
frame #24: 0x000000010eba9bf1 libcoreclr.dylib`AllocateObject(pMT=0x0000000110c63688, flags=GC_ALLOC_CONTAINS_REF) at gchelpers.cpp:1101:37 [opt]
frame #25: 0x000000010eaa7cc9 libcoreclr.dylib`FieldDesc::GetStubFieldInfo() [inlined] AllocateObject(pMT=<unavailable>) at gchelpers.h:68:12 [opt]
frame #26: 0x000000010eaa7cc2 libcoreclr.dylib`FieldDesc::GetStubFieldInfo(this=0x0000000110c08250) at field.cpp:803:49 [opt]
frame #27: 0x000000010ebc9689 libcoreclr.dylib`JIT_GetRuntimeFieldStub(field=0x0000000110c08250) at jithelpers.cpp:3635:43 [opt]
There's no .crash but a .ips which is basically the same but json encoded: dotnet-2023-10-09-182901.ips.zip (also added the rendered report)
Btw. the binary was built from commit d3a782e3bb6ad0c1cb590b41c3f03e733f7d0d61
Interestingly the .ips points to Thread 18 (== Thread 19 in lldb since that uses 1-based indexing) as the thread that had the SIGSEGV, which points to libclrjit.dylib`Compiler::fgCompactBlocks(BasicBlock*, BasicBlock*) [inlined] BasicBlock::isLoopAlign(this=0x00007f92afd7ffd0) const at block.h:614:44 [opt]
Here's the disassembly of the function https://gist.github.com/akoeplinger/621f3de8abf8dfd01f62c941d5d552fe
I wasn't able to get lldb crashlog
to do anything useful since I didn't find a way to load the .dwarf symbols (just loading it via add-dsym doesn't work, while that works for the core dump)
Poking at the function a bit:
(lldb) t 19
* thread #19
frame #0: 0x00000001af72a8a1 libclrjit.dylib`Compiler::fgCompactBlocks(BasicBlock*, BasicBlock*) [inlined] BasicBlock::isLoopAlign(this=0x00007f92afd7ffd0) const at block.h:614:44 [opt]
611
612 bool isLoopAlign() const
613 {
-> 614 return ((bbFlags & BBF_LOOP_ALIGN) != 0);
615 }
616
617 void unmarkLoopAlign(Compiler* comp DEBUG_ARG(const char* reason));
(lldb) p bbNext
(BasicBlock *) 0x00007f92afd7ff68
(lldb) p bbPrev
(BasicBlock *) 0x00007f92af840001
(lldb) p bbJumpSwt
(BBswtDesc *) 0x00007f92afd7ff80
(lldb) p bbFlags
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p bbNum
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
@akoeplinger should this be in the codegen area?
Happened to have a look at this and it does appear to only be failing on Mac. It's a bummer we aren't getting dumps there yet @hoyosjs @carlossanlop - this type of issue would really benefit from crash symbolization. I thought with the latest changes that should be working on Macs?
Agree with @danmoseley that this looks more like codegen issue.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Issue Details
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=431489 Build error leg or test failing: Libraries Test Run release coreclr osx x64 Release Pull request: N/A
Error Message
Fill the error message using step by step known issues guidance.
{
"ErrorMessage": "",
"ErrorPattern": "Segmentation fault.*System.Text.RegularExpressions.Tests",
"BuildRetry": false,
"ExcludeConsoleLog": false
}
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=431489
Error message validated: Segmentation fault.*System.Text.RegularExpressions.Tests
Result validation: :white_check_mark: Known issue matched with the provided build.
Validation performed at: 10/9/2023 11:24:45 AM UTC
Report
Summary
24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
---|---|---|
0 | 12 | 62 |
Author: | akoeplinger |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | 9.0.0 |
I'm actively investigating a product issue where dumps are not getting collected.
@kunalspathak, it seems loop alignment related. PTAL. It is blocking clean ci.
are we still seeing this issue? I don't think so.
https://dev.azure.com/dnceng-public/public/_build/results?buildId=598100&view=ms.vss-test-web.build-test-results-tab&runId=14495740&resultId=215086&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
@kunalspathak that's from yesterday, the dump sadly didn't get egressed. The method that failed was Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.Lexer.AddTrivia(Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.CSharpSyntaxNode, Microsoft.CodeAnalysis.Syntax.InternalSyntax.SyntaxListBuilder ByRef)
at IL offset 0x1d
@jakobbotsch @amanasifkhalid - can one of you please take a look as you recently touched the loops/block layout code. This seems to be accessing a null BasicBlock and we get seg fault.
@amanasifkhalid, PTAL.
fgCompactBlocks
(and the rest of the JIT's flowgraph code, for that matter) has undergone a lot of churn lately. I'm no longer seeing that method come up in the backtraces for recent failures. However, recent crash reports suggest the failure is due to a System.Reflection.TargetInvocationException
(example); I suppose we seg fault while trying to handle it? In that particular run, thread 14 (which is thread 15 in lldb) crashed with the exception; I included a backtrace in the above gist. You'll see a couple of threads are in the JIT during the crash, but the disassemblies don't seem to have any obvious null dereferences. BasicBlockVisit BasicBlock::VisitEHSuccs
looks a bit suspect in that we don't have any guards against dereferencing a null BasicBlock*
, but if we were to pass a null block pointer to it, then we should've attempted to dereference that null pointer earlier in the call stack.
Unless I'm missing something, the seg fault doesn't seem to be happening in the JIT.
The two recent appearances are from preliminary runs in PRs that had issues. So I would probably hold off looking at the crash dumps (if any).
Since this hasn't hit recently, I'm going to unmark blocking-clean-ci
for now, and keep an eye on this. If it hits again, I'll revisit the crash dumps, and (if necessary) re-triage this.
Hard to say, I think codegen is a good a guess as any. Are all the crashes on osx-x64?
It's worth noting that the most recent failure (#100658) was from an intermediate commit that hit other issues in CI, so maybe that was a false positive?
Are all the crashes on osx-x64?
Not all of them. Some of them hit on Linux arm64.
@amanasifkhalid I apologize for deleting one of your comments. Reminder that internal helix logs should not be shared in GitHub comments.
@riarenas no worries, sorry about that.
This hasn't hit on a "functional" CI run in quite a while. Are we ok with closing this?
The most recent hit was on a draft PR with other failures. Since we haven't had a failure block CI recently, I think we ought to close this to avoid instilling a false sense of confidence on affected PRs.