status-mobile icon indicating copy to clipboard operation
status-mobile copied to clipboard

catch golang panic to avoid app crash

Open qfrank opened this issue 1 year ago • 16 comments

Feature Issue

Description

We should use the safeGo wrapper function instead of go func...() directly so that we can log some important information such as the invoking stack when a panic occurs to help locate the root cause quickly. This approach can also enhance the user experience instead of causing the app to crash. We should also use recover for each API call. Sometimes it would be hard to solve the crash problem if we couldn't get the accurate log.

func safeGo(f func(), panicHandler ...func(interface{})) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Printf("Panic in goroutine: %v\n%s", r, debug.Stack())
                
                if len(panicHandler) > 0 {
                    panicHandler[0](r)
                }
            }
        }()
        f()
    }()
}

As discussed with @cammellos ,

mmh, not sure we should catch panics panics should basically almost never happen if we catch them, it would incentives dev to write bad code potentially maybe only enable SafeAPICall on prod

qfrank avatar Aug 26 '24 09:08 qfrank

WDYT? @status-im/status-go-guild cc @ilmotta

qfrank avatar Aug 26 '24 09:08 qfrank

The issue with this type of strategy, which is not even exclusive to Go, is that if we always recover panics, there can be cases where the runtime will be left in an inconsistent state and will dangerously continue to run. This is particularly problematic in a code base like status-go where mutations happen all the time. This can make things worse in some cases, giving users an even more unpredictable experience.

Purely on the UX side, I agree, ideally, users of any of software shouldn't face hard crashes. Few things (to me at least) are as frustrating as getting a crash in the middle of something important.

A catch-all panics solution used in development can be a bad incentive. I agree with @cammellos there.

ilmotta avatar Aug 26 '24 09:08 ilmotta

If recover is not used, how can the crash problems encountered by real users be located and solved? In the absence of logs, this will become very difficult. Real users cannot use android logcat and give feedback like testers. If it is an iOS user, this will become even more difficult because the crash information is very difficult to understand. 🤔 @ilmotta

qfrank avatar Aug 26 '24 09:08 qfrank

If recover is not used, how can the crash problems encountered by real users be located and solved?

I believe we could integrate Sentry for this.

igor-sirotin avatar Aug 26 '24 10:08 igor-sirotin

If recover is not used, how can the crash problems encountered by real users be located and solved? In the absence of logs, this will become very difficult. Real users cannot use android logcat and give feedback like testers. If it is an iOS user, this will become even more difficult because the crash information is very difficult to understand. 🤔 @ilmotta

I agree with all the problems you listed @qfrank. Without stack traces it becomes really difficult to diagnose certain problems. In development, I would keep it disabled by default and get hard crashes.

As @igor-sirotin said, something like Sentry would help, but it would be yet another centralization point in the product.

ilmotta avatar Aug 26 '24 10:08 ilmotta

something like Sentry would help, but it would be yet another centralization point in the product.

@ilmotta true, but we could have it enabled it only for devs, e.g. as part of telemetry.

igor-sirotin avatar Aug 26 '24 10:08 igor-sirotin

Can Sentry ensure that the production environment won't have crashes and quickly locate the root cause of crashes for the PR build? If so, I'd love to use it!

There can be cases where the runtime will be left in an inconsistent state and will dangerously continue to run

That's why I added panicHandler for use. It requires developers to handle the panic, whether there is a chance to fix it or just panic again and exit. We can default to panic again!

I meant to only enable catching/logging panics for the production environment by default and disable it in the development environment. We can enable it in some hidden ways, such as shaking the device or pressing a button a specific number of times. When we try to fix a crash, it's painful to start the development environment to reproduce as it's not fast and takes a long time to compile and build... cc @ilmotta @igor-sirotin

qfrank avatar Aug 27 '24 03:08 qfrank

if the issue is logging, what about we catch the panic, log it, and throw it again?

cammellos avatar Aug 27 '24 09:08 cammellos

if the issue is logging, what about we catch the panic, log it, and throw it again?

yeah, equal to We can default to panic again! 👍

qfrank avatar Aug 27 '24 09:08 qfrank

because we can't trust other devs to fix panic when it panic, so basically, the code would be like following:

func safeGo(f func()) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Printf("Panic in goroutine: %v\n%s", r, debug.Stack())
                // panic again
                panic(...)
            }
        }()
        f()
    }()
}

qfrank avatar Aug 27 '24 09:08 qfrank

if the issue is logging, what about we catch the panic, log it, and throw it again?

But doesn't it log the panic by default?

igor-sirotin avatar Aug 27 '24 12:08 igor-sirotin

if the issue is logging, what about we catch the panic, log it, and throw it again?

But doesn't it log the panic by default?

no, point it out if you know it's somewhere :)

qfrank avatar Aug 27 '24 13:08 qfrank

no, point it out if you know it's somewhere :)

hmm, maybe it's only true for tests 🤔 https://github.com/status-im/status-go/issues/5759

igor-sirotin avatar Aug 27 '24 13:08 igor-sirotin

no, point it out if you know it's somewhere :)

hmm, maybe it's only true for tests 🤔 status-im/status-go#5759

@igor-sirotin the panics go to stdout, on mobile that's not easy to access, as on android there's a lot going to stdout, and they only keep the last n bytes, so it's easy to miss. I believe it's harder on ios, @qfrank knows.

cammellos avatar Aug 27 '24 13:08 cammellos

Can Sentry ensure that the production environment won't have crashes and quickly locate the root cause of crashes for the PR build?

@qfrank, Sentry is a tool that helps teams proactively solve problems even before users report them. Teams can set their own alarms and dashboards, which contrasts with our current approach of passively waiting for users to share their logs and explain issues to us. Sentry can help during development, but I think its main value is improved observability in prod. We know the vast majority of users won't ever share logs, they will just vanish 🏃🏼 There is a lot of good stuff from Sentry that traditional apps can use that go beyond logging stack traces.

But Sentry is a piece of centralization, and ideally if we use it, we should make it opt-in. But then we need to update the privacy policy and so on...


  1. Recovering from panics to improve traceability by logging stacktraces is an obvious win. The more info we can get from a failure, the better 👍🏼

  2. About ignoring/swallowing panics to remove crashes: it might be fine, but it would help if status-go didn't have explicit calls to panic. There are not many, but take this example (should be a normal error):

// services/wallet/history/balance.go
if toTimestamp < firstTimestamp {
    panic("toTimestamp < fromTimestamp")
}

Do we need to make an all-or-nothing decision about using safeGo everywhere or nowhere? Why can't we use it in problematic places, and possibly temporarily? I say that because some parts of status-go are well written and well tested and panics should be very rare and we should be able to catch them during dev/QA before they reach users. Other parts, like new wallet code, are not in great shape and could benefit from a more permissive way of handling panics.


Panics can happen outside goroutines and would still crash the app. Any reason why the issue only refers to panics within goroutines?

cc @Samyoul

ilmotta avatar Aug 27 '24 17:08 ilmotta

My idea is as follows: First, add a layer of wrapper for some APIs of mobile/status.go to enable log recording when panic occurs. Secondly, add a layer of wrapper for all go func(){...} to achieve log recording when panic occurs(it should be easy to find and replace, we should have commons to use safeGo to create go routines). As for the RPC call, the current code has already used recover and recorded logs. Maybe logGo would be a better name than safeGo

I tried redirect panic error to log file, but this seems impossible, and recover is the only option for us to get a chance to log after panic

qfrank avatar Aug 28 '24 04:08 qfrank

As was reading this thread I had the thought "What about recovering from a panic only for logging, then re-panic.", but this is the solution you came across anyway.

In my opinion https://github.com/status-im/status-mobile/issues/21126#issuecomment-2312000040 is a reasonable solution. It does make our go routines unconventional, but I think that building for mobile contexts is also unconventional.

I am also in favour of adding a recovery handler like in your original suggestion:

The only thing I'd change is to treat the variadic parameter as a slice of an unknown length, because the expected behaviour of a variadic parameter is a slice with length of zero or more.

func safeGo(f func(), panicHandlers ...func(interface{})) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Printf("Panic in goroutine: %v\n%s", r, debug.Stack())
                
                for _, handler := range panicHandlers {
                    handler(r)
                }
                panic(r)
            }
        }()
        f()
    }()
}

Samyoul avatar Sep 03 '24 10:09 Samyoul