solana icon indicating copy to clipboard operation
solana copied to clipboard

Consistent segfaults on 1.10.34

Open codemonkey6969 opened this issue 2 years ago • 7 comments

I replicated the issue 3 different times:

1st time:

Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809076] blockstore_22[18150]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb3f2373f8 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809080] blockstore_11[18089]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb46c74768 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809090] blockstore_5[18055]: segfault at 561791650aa0 ip 0000561791650aa0 sp 00007ecb4ae95318 error 15
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809096] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809106] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809107] in solana-validator[5617915bd000+165000]
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809118] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00
Aug 7 01:16:23 NM-PROD-RPC1 kernel: [25914.809127] Code: 00 00 40 82 bd 8f 17 56 00 00 90 53 be 8f 17 56 00 00 80 15 be 8f 17 56 00 00 a0 6b be 8f 17 56 00 00 70 40 be 8f 17 56 00 00 <08> f0 60 f2 86 7f 00 00 60 00 b3 90 17 56 00 00 00 00 00 00 00 00

Second time:

Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188157] show_signal_msg: 20 callbacks suppressed
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188159] sol-rpc-el[3075]: segfault at 55d85fbf7aa0 ip 000055d85fbf7aa0 sp 00007f20461e0768 error 15 in solana-validator[55d85fb64000+165000]
Aug 7 13:05:06 NM-PROD-RPC1 kernel: [37423.188166] Code: 00 00 40 f2 17 5e d8 55 00 00 90 c3 18 5e d8 55 00 00 80 85 18 5e d8 55 00 00 a0 db 18 5e d8 55 00 00 70 b0 18 5e d8 55 00 00 <08> a0 87 37 d8 7f 00 00 60 70 0d 5f d8 55 00 00 00 00 00 00 00 00

Third time:

Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988592] rocksdb:low[49990]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007f5d7d45d018 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.988599] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991253] solana-window-i[51047]: segfault at 5602332c8aa0 ip 00005602332c8aa0 sp 00007e9c50bf97d8 error 15 in solana-validator[560233235000+165000]
Aug 7 16:26:17 NM-PROD-RPC1 kernel: [49493.991259] Code: 00 00 40 02 85 31 02 56 00 00 90 d3 85 31 02 56 00 00 80 95 85 31 02 56 00 00 a0 eb 85 31 02 56 00 00 70 c0 85 31 02 56 00 00 <08> 50 87 09 5e 7f 00 00 60 80 7a 32 02 56 00 00 00 00 00 00 00 00

This occurs when a massive amount of RPC traffic occurs for an extended period of time. The system COMPLETELY halts. No chance to recover itself as it requires a full power reset on the associated server chassis.

codemonkey6969 avatar Aug 08 '22 04:08 codemonkey6969

@codemonkey6969 - I had asked in Discord but might have been lost, do you have any core dumps available? We definitely appreciate you reporting; however, the segfault logs lines unfortunately don't give us much to go off. Being able to poke around a core dump might be more illuminating.

One interesting note is that I had previously been a little suspicious of blockstore/rocksdb because of https://github.com/solana-labs/solana/issues/25941. The threads in this report that are segfaulting all have a handle to blockstore. More so, one of the segfaults under your Third time section block is a thread spun up by rocksdb.

@yhchiang-sol - FYI for visibility, and maybe you can comment if there might be anything of value in rocksdb logs to inspect.

steviez avatar Aug 08 '22 07:08 steviez

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

Re: #25941 that is super interesting. Definitely could see that as a potential issue. I will let you know as soon as I have a core dump to share! Thanks Steviez!

codemonkey6969 avatar Aug 08 '22 07:08 codemonkey6969

I will be traveling within 24 hours so might not have time to take a deeper look, but I will do my best!

Re: https://github.com/solana-labs/solana/issues/25941 that is super interesting. Definitely could see that as a potential issue. I will let you know as soon as I have a core dump to share! Thanks Steviez!

If I remember it correctly, the cause of https://github.com/solana-labs/solana/issues/25941 is due to the fact that rocksdb requires all child threads to be joined before rocksdb starts its shutdown, and https://github.com/solana-labs/solana/issues/25933 tries to fix the tvp and tpu hang but could shutdown rocksdb before joining all child-threads.

yhchiang-sol avatar Aug 08 '22 13:08 yhchiang-sol

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

@codemonkey6969 do you happen to have the rocksdb LOG file(s) that is associated with your previous validator crashes? The rocksdb LOG file is under the rocksdb directory inside your ledger directory. It's file name is called LOG. It might have some useful information.

yhchiang-sol avatar Aug 08 '22 13:08 yhchiang-sol

Just enabled apport to capture the core dump. It is only a matter of time before it happens again and can provide you with a core dump.

@codemonkey6969 do you happen to have the rocksdb LOG file(s) that is associated with your previous validator crashes? The rocksdb LOG file is under the rocksdb directory inside your ledger directory. It's file name is called LOG. It might have some useful information.

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

Any suggestions on enabling a coredump? I enabled apport but nothing ended up dumping. Thanks.

codemonkey6969 avatar Aug 08 '22 17:08 codemonkey6969

Any suggestions on enabling a coredump? I enabled apport but nothing ended up dumping. Thanks.

I personally use systemd-coredump, but I see no reason as to why apport shouldn't work. It sounds like you already know apport is running, so the other quick thing to check is:

$ ulimit --help
...
      -c	the maximum size of core files created
...

If this value is 0, no core dumps will be created; I have the value unlimited set on my dev machine. Otherwise, this SO post has a good walk through with things to check/try: https://stackoverflow.com/questions/48178535/no-core-dump-in-var-crash

steviez avatar Aug 08 '22 18:08 steviez

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

codemonkey6969 avatar Aug 10 '22 17:08 codemonkey6969

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

I checked the above rocksdb log file, but there isn't anything abnormal.

@codemonkey6969: can I know whether this is the one that is associated with the crash? The log file starts at 2022/08/08-10:37:18.391988 while your previous reported crashes are all on Aug 7th. In case you happen to see a crash again, can you upload the rocksdb LOG file before you restart the instance? Thank you!

yhchiang-sol avatar Aug 12 '22 22:08 yhchiang-sol

This is a consistent issue. It's super easy to replicate. I unfortunately cannot provide those logs prior to restarting the instance as I have to hard power cycle to even get the machine back online.

Here is the rocksdb log file: https://snapshots.nodemonkey.io/snapshots/LOG.txt

I checked the above rocksdb log file, but there isn't anything abnormal.

@codemonkey6969: can I know whether this is the one that is associated with the crash? The log file starts at 2022/08/08-10:37:18.391988 while your previous reported crashes are all on Aug 7th. In case you happen to see a crash again, can you upload the rocksdb LOG file before you restart the instance? Thank you!

codemonkey6969 avatar Aug 12 '22 22:08 codemonkey6969

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

cc @lijunwangs for RPC service-related issues and how to enable additional debug logs.

Btw, there's a recently merged PR that handles JsonRpcService failure more gracefully and thus will prevent rocksdb from running under an unstable situation (such as the validator is already panic). Please feel free to give it a try!

https://github.com/solana-labs/solana/pull/27075

yhchiang-sol avatar Aug 12 '22 22:08 yhchiang-sol

Cannot create a core dump. It is not triggering one. I want to reiterate how impactful this issue is to my RPC service. Every time it receives high traffic, the system completely halts. The service doesn't crash, the system simply cannot operate so I have to manually reboot it.

cc @lijunwangs for RPC service-related issues and how to enable additional debug logs.

Btw, there's a recently merged PR that handles JsonRpcService failure more gracefully and thus will prevent rocksdb from running under an unstable situation (such as the validator is already panic). Please feel free to give it a try!

#27075

Cannot use this PR within the 1.10.34 branch. It references undefined values.

codemonkey6969 avatar Aug 12 '22 23:08 codemonkey6969

Have you tried temporarily rename the ledger store directory and see if the issue reproduce? What is the full command to start the validator?

lijunwangs avatar Aug 13 '22 00:08 lijunwangs

Also can you paste the result of "dmesg"?

lijunwangs avatar Aug 13 '22 00:08 lijunwangs

Have you tried temporarily rename the ledger store directory and see if the issue reproduce? What is the full command to start the validator?

This is replicated across several different machines and different hardware. I have wiped out the ledger and started from scratch. The full script to start, I can provide via a DM on discord. Tim also has it.

codemonkey6969 avatar Aug 13 '22 00:08 codemonkey6969

Also can you paste the result of "dmesg"?

Attached output.txt

codemonkey6969 avatar Aug 13 '22 00:08 codemonkey6969

Also can you paste the result of "dmesg"?

Attached output.txt

Ahh dang; you did give us what he asked for; unfortunately, this is after the boot and thus doesn't show any useful information from the crash. I think you can use journalctl to see logs from previous boots (other feel free to correct me)

steviez avatar Aug 13 '22 01:08 steviez

Also can you paste the result of "dmesg"?

Attached output.txt

Ahh dang; you did give us what he asked for; unfortunately, this is after the boot and thus doesn't show any useful information from the crash. I think you can use journalctl to see logs from previous boots (other feel free to correct me)

Added in discord.

codemonkey6969 avatar Aug 15 '22 21:08 codemonkey6969

This resolved the issue. CC: @lijunwangs @yhchiang-sol

codemonkey6969 avatar Aug 26 '22 18:08 codemonkey6969

@codemonkey6969: Thanks for confirming the PR resolved the issue.

yhchiang-sol avatar Aug 26 '22 23:08 yhchiang-sol