Q: what does it mean when `verifyheap` reports heap errors?
I have a heap dump that verifyheap says contains errors:
Heap Segment Object Failure
2 7f6f2748d500 7f6fd2002fd0 InvalidMethodTable Object 7f6fd2002fd0 has an invalid method table 0
11,914 objects verified, 1 error.
This happened after I left a script running for around 20 hours. The script would create a hello-world ASP.NET Core application, run it and get a heap dump from it. After around 20 hours, the script produced a dump where verifyheap failed.
Is this a bug in the .NET runtime? (It's a self-built one through source-build, if it matters). Could it be that we captured the dump at some wrong point in the runtime execution? How would I go about finding the correct place to report this issue to get it fixed?
This failure could be (most likely # 1):
- Catching the GC in a place that the heap isn't consistent when the dump is created. SOS is suppose to display a message if GC is in middle of something. Maybe the new rewritten verifyheap doesn't do that anymore (need to check with Lee). !sosstatus will check if the GC is in a valid state.
- Missing part of the heap. The heap is consistent but the dump generator didn't catch the whole heap. This might be a unlikely bug in createdump.
- A bug in the verifyheap/dumpheap commands or the clrmd support. They were recently rewritten and there could be a bug.
I am using dotnet dump:
$ dotnet dump --version
7.0.421201+e01dddacec76df94dbd83b15d8b55aa5a6bb3b9e
$ dotnet dump analyze coredump-2023-06-07.fails-to-verifyheap
Loading core dump: coredump-2023-06-07.fails-to-verifyheap ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> sosstatus
Target OS: LINUX Architecture: X64 ProcessId: 2255343 (0x2269EF)
Dump path: /home/omajid/dotnet-dump-archive/coredump-2023-06-07.fails-to-verifyheap
Current symbol store settings:
-> Directory: /home/omajid/dotnet-dump-archive
-> Cache: /home/omajid/.dotnet/symbolcache
-> Server: https://msdl.microsoft.com/download/symbols/ Timeout: 4 RetryCount: 3
GC memory usage for managed SOS components: 1,532,312 bytes
> verifyheap
Heap Segment Object Failure
2 7f6f2748d500 7f6fd2002fd0 InvalidMethodTable Object 7f6fd2002fd0 has an invalid method table 0
11,914 objects verified, 1 error.
Sorry, I gave you the wrong command to check the GC. It is eeversion.
> verifyheap
Heap Segment Object Failure
2 7f6f2748d500 7f6fd2002fd0 InvalidMethodTable Object 7f6fd2002fd0 has an invalid method table 0
11,914 objects verified, 1 error.
> eeversion
7.0.523.21201
7.0.523.21201 @Commit: 8042d61b17540e49e53569e3728d2faa1c596583
Server mode with 20 gc heaps
SOS Version: 7.0.421201
>
@omajid is it possible to share the dump and the koji build that reproduced this issue?
The dump is here: https://people.redhat.com/~omajid/coredump-2023-06-07.fails-to-verifyheap
The SDK is https://koji.fedoraproject.org/koji/buildinfo?buildID=2186314. If you are running Fedora 38, you can install it using just dnf install -y dotnet-sdk-7.0 right now.
For a more reproducible setup:
FROM registry.fedoraproject.org/fedora:38
RUN dnf install -yq koji \
&& koji download-build -a x86_64 dotnet7.0-7.0.105-1.fc38 \
&& dnf install -yq ./*rpm \
&& dotnet --info
hmm @omajid I might not be as familiar with dnf/koji. I tried getting the debuginfo packages. However, LLDB always gives me:
(lldb) target create "dotnet" --core "coredump-2023-06-07.fails-to-verifyheap"
warning: (x86_64) /usr/bin/dotnet unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/host/fxr/7.0.5/libhostfxr.so unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.5/libhostpolicy.so unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.5/libcoreclr.so unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.5/libcoreclrtraceptprovider.so unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.5/libclrjit.so unsupported DW_FORM values: 0x1f20 0x1f21
warning: (x86_64) /usr/lib64/dotnet/shared/Microsoft.NETCore.App/7.0.5/libSystem.Native.so unsupported DW_FORM values: 0x1f20 0x1f21
Both under LLDB and WinDBG I don't get much symbolic information on the GC types, so debugging this is hard. The symbols also have no line information on lldb. Is this a known issue?
The only thing I can see is that the GC is waiting and I cna't see the state, but it doesn't seem to be doing any planning. However, heap 2 has a 24 gap at the end of the only gen0 region [7f6fd2002fd0-7f6fd2002fe8], which is below the high alloc mark, but these 24 bytes are all 0's - and we always expect at least the free object method table. It's after the array of thread samples in the hill climber. cc: @dotnet/gc in case they have seen this.
Interestingly enough, the gchist_index is 0. bgc is not happening, and gc state is free. So I am not sure how we got to a point where there's no MT there, it's considered alloc'd. Moving to the runtime to see if folks have any better ideas.
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Issue Details
I have a heap dump that verifyheap says contains errors:
Heap Segment Object Failure
2 7f6f2748d500 7f6fd2002fd0 InvalidMethodTable Object 7f6fd2002fd0 has an invalid method table 0
11,914 objects verified, 1 error.
This happened after I left a script running for around 20 hours. The script would create a hello-world ASP.NET Core application, run it and get a heap dump from it. After around 20 hours, the script produced a dump where verifyheap failed.
Is this a bug in the .NET runtime? (It's a self-built one through source-build, if it matters). Could it be that we captured the dump at some wrong point in the runtime execution? How would I go about finding the correct place to report this issue to get it fixed?
| Author: | omajid |
|---|---|
| Assignees: | hoyosjs |
| Labels: |
|
| Milestone: | - |
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Issue Details
I have a heap dump that verifyheap says contains errors:
Heap Segment Object Failure
2 7f6f2748d500 7f6fd2002fd0 InvalidMethodTable Object 7f6fd2002fd0 has an invalid method table 0
11,914 objects verified, 1 error.
This happened after I left a script running for around 20 hours. The script would create a hello-world ASP.NET Core application, run it and get a heap dump from it. After around 20 hours, the script produced a dump where verifyheap failed.
Is this a bug in the .NET runtime? (It's a self-built one through source-build, if it matters). Could it be that we captured the dump at some wrong point in the runtime execution? How would I go about finding the correct place to report this issue to get it fixed?
| Author: | omajid |
|---|---|
| Assignees: | hoyosjs |
| Labels: |
|
| Milestone: | - |
Adding the GC tag since the dump contains the memory showing that state. This is source-built 7.0.523.21201 @Commit: 8042d61b17540e49e53569e3728d2faa1c596583
most likely some heap corruption. Might need running with DOTNET_HeapVerify to check if it repros with that.
Some quick chat with maoni - looks like an SOS bug.
Is this a known issue?
It's possible debug symbols are broken. Do you have a small test case I can use to verify that the symbols are functional/usable?
I just used your dump with the fedora 38 image, lldb from dnf and the debuginfo packages I got from Koji.
This should be fixed with the latest version of SOS. Please let us know if there are any issues with the updated command!