WalletWasabi icon indicating copy to clipboard operation
WalletWasabi copied to clipboard

Crash gracefully

Open turbolay opened this issue 2 years ago • 13 comments

Thanks @molnard for part of the code and @adamPetho for the help

This PR provides a way to crash the software when we reach an unrecoverable exception and display the crash reporter. The PR also uses this new function when the WalletFilterProcessor cannot work anymore

On Fluent.Desktop, the exception is logged twice when WalletFilterProcessor crashes, but I think it's ok because on the Daemon it will only be logged once.

turbolay avatar Feb 08 '24 14:02 turbolay

I tested with #12392 and can confirm that it works as expected EDIT: PR Is rebased, so you can simply test this by adding a line throw new Exception("test"); in WalletFilterProcessor.ExecuteAsync

turbolay avatar Feb 08 '24 17:02 turbolay

From what @yahiheb sent, the PR is working as expected, but it seems that the CrashReporter still doesn't work on master for windows?? It works on Mac now. Can you test again @adamPetho?

turbolay avatar Feb 09 '24 05:02 turbolay

Can you test again?

Yep, still no Crash Reporter on Win11. Only a hanging WW process, which does nothing.

adamPetho avatar Feb 09 '24 10:02 adamPetho

We have talked about this PR with Clement and this approach is really not good. Just to shortly say why:

  • Adding static is a red flag.
  • Child components depending on parent components is another red flag (here it's WalletFilterProcessor depending on TerminateService)

A better approach is IMO that one detects that WalletFilterProcessor's execution task failed and then this signal should be propagated and should lead to crash with crash reporter instance being shown.

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

This concept of checking of services for exceptions can be then used for all services without modifying the services themselves. That would be a win as well.

kiminuo avatar Feb 15 '24 09:02 kiminuo

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

Graceful shutdown means to me that I ask the components to stop. Crash immediately means to just stop the execution of the Threads / Tasks wherever they are. Is this what you mean by that? If disposals are not called and loops are not canceled it will lead to an unknown state of the underlying resources - I am sure we cannot cover all the cases to test if it is safe and won't cause issues at the next start of the application.

First, we need to decide on this!

Adding static is a red flag.

We have multiple options, see this and below. Add the static and make it work - is something preventing us from that? I don't see anything that would be broken by that, moreover, it could fix what is broken now. In general, it is a red flag but red flags do not mean it is an issue by default, just something we need to look very closely and consider wisely. One more perspective on this, is TerminateService a program-wise service that everyone can use? Like Logging or HostedServices. If it is like that we might consider having a similar structure here.

This concept of checking of services for exceptions can be then used for all services without modifying the services themselves.

I can imagine a solution with an interface that is added to all classes that can cause crashes and gather them in TerminateService. Whenever there is a signal it will start the crash.

molnard avatar Feb 15 '24 11:02 molnard

It's IMO better to better to crash immeditely in case some unhandled exception was thrown. Otherwise, there is a risk that the graceful shutdown makes things worse.

Graceful shutdown means to me that I ask the components to stop.

Graceful shutdown is typically: A program is running fine and at some point user decides to turn it off and at that point everything is stopped and disposed. The important part is that the program is running "fine" (no service is terribly broken)

Crash immediately means to just stop the execution of the Threads / Tasks wherever they are. Is this what you mean by that?

By crash I mean that the program should stop all its threads immediately and no cleanup should be done whatsoever.

So I'm saying "if WalletFilterProcessor (WFP) throws an exception that is unhandled or effectively unhandled (e.g. just logging is not enough to handle an exception) then we should crash and show the crash reporter, not ask for a graceful shutdown"[^1]. The reason why I think it makes good sense is: If there is an unhandled exception in the WFP (or any other background service really, WFP is a placeholder here), then there is a risk that during that graceful shutdown you make things worse (e.g. store corrupted data, or throw another exception at a different place that would obfuscate the original error, or some other service stops working because the other is dead, etc. etc).[^2]

However, to clarify it, it's important to say when one should not crash. And that is whenever we can fully recover from an exceptional state and we actually implemented that handling (like if there are connectivity issues we can recover, we simply try again, etc. So this is covered and hence no crash is needed.)

Why should we care? Because if the app crashes, then we can fix it. If we don't crash and we just log the error, then the likelyhood of the issue being fixed is small (the so called it somehow works so why bother principle)[^3].

[^1]: An analogy here is "if sudden big fire erupts in a house, the owner leaves immediately and calls firefighters. She skips the normal house-leaving behavior: 1) turn off lights, 2) turn off TV 3) lock the door, etc." [^2]: Note that if a (background) service throws an unhandled exception and stops working it means that we do not have robust enough implementation for that service and so we don't really know how bad the situation is. It might be actually OK or it might be catastrophic (an invalid program state). So it makes more sense to kill the program than to pretend that everything is fine. [^3]: There is also a precedent, when we worked on CoinsRegistry it had multiple quite severe bugs (for example, undo function did not work properly). It become obvious only after we added more strict checks and began to throw exceptions that these bugs could be fixed.

kiminuo avatar Feb 16 '24 10:02 kiminuo

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

  • If there is an unhandled exception in the WFP (or any other background service really, WFP is a placeholder here), then there is a risk that during that graceful shutdown you make things worse (e.g. store corrupted data

    I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash. You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

  • We made numerous efforts to get rid of individual exit points in the application. It was a good move and an improvement - we have to keep it this way while figuring out the solution.

molnard avatar Feb 16 '24 11:02 molnard

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

Yes, the exception needs to lead to that action in general. I proposed "propagation of the exception", you did that stuff with the TerminateService.

  • I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash.

Could you describe them?

You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

So my concern are scenarios like:

  1. A WW instance runs
  2. A component breaks (e.g. we crash on a blockchain reorg) -> we gracefully shut down and we store some state that is neither "before reorg" or "after reorg"
  3. User starts another instance of WW and her balance is invalid because we won't re-process some blocks.

Something like this.

Anyway, the question is a philosofical one, not that much practical[^1]. I'm saying "if a part of application crashes, attempt to terminate it immediately, not to make things worse" (a pessimistic approach). You are saying "if a part of application crashes, then it's still safe to terminate gracefully" (that's an optimistic approach). Note that there are still scenarios when you don't have it under your control -- when a machine battery dies.

I don't believe that there is an approach that would work always. It feels like all approaches are compomises.

edit: This is what I basically talk about https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html#guidelines-for-error-handling.

[^1]: Whatever example I come up with, you can say "it will be ok" (and be right) but then the code changes and nobody is checking how services can crash, what the relationships are, etc. So it means that if you are right now, you might not be right tomorrow.

kiminuo avatar Feb 16 '24 22:02 kiminuo

Could you describe them?

I assumed we use Environment.Exit but the consequences are anyway even if we use something else.

  • Immediate Termination: cleanup code (like finally blocks or destructors) may not necessarily run, depending on where and how Environment.Exit is called. This can lead to resources not being released properly or other cleanup tasks not being completed.

    • Backup wallet https://github.com/molnard/WalletWasabi/blob/master/WalletWasabi/Wallets/WalletManager.cs#L383
    • Saving Prison to file https://github.com/molnard/WalletWasabi/blob/master/WalletWasabi/WabiSabi/Client/Banning/CoinPrison.cs#L156
    • File.WriteAllText
    • All BackgroudService and PeriodicRunner.
  • Skipping Finalization: Some finalizers (destructors) may not run if "crash" is used. The runtime tries to run finalizers if Environment.Exit is called, but if the exit is immediate and the finalizers take too long, they may not complete.

  • Thread Termination: Environment.Exit will terminate all threads in the application. If there are background operations or other threads running that are doing important work or require proper shutdown to maintain data integrity, using Environment.Exit can disrupt these processes.

    • Tx processor. The execution can stop anywhere without breaking anything. How to prove/garantee that? https://github.com/molnard/WalletWasabi/blob/master/WalletWasabi/Blockchain/TransactionProcessing/TransactionProcessor.cs#L225
  • Application State: If the application maintains state or has temporary data that needs to be saved or persisted, calling Environment.Exit without ensuring this data is saved can result in data loss.

  • External Resources and Commitments: If your application is holding external resources (like database connections, file handles, network connections, etc.), using Environment.Exit might not allow for these resources to be released or committed properly. This could lead to data corruption, lost transactions, or other issues with resources that were being managed by the application.

    • SQL, I guess that should not break. https://github.com/molnard/WalletWasabi/blob/master/WalletWasabi/Blockchain/Transactions/TransactionStore.cs#L255 -Microservices, Tor, Bitcoind
  • Use in Libraries: If you're developing a library that is used by other applications, calling Environment.Exit within a library is highly discouraged. It can unexpectedly terminate the host application, leading to a poor user experience and making it difficult for applications using your library to manage their lifecycle properly.

molnard avatar Feb 19 '24 21:02 molnard

  • Agreement archived: we need to stop/crash the program for sure. The issue here is unrecoverable. Simply just throwing an exception is not enough because it won't stop the program or visible, it is a background task.

Yes, the exception needs to lead to that action in general. I proposed "propagation of the exception", you did that stuff with the TerminateService.

  • I think the risk of corrupting data is much higher if we just kill the application. For me, this seems obvious. In this specific case I fail to see how it can go bad if we do graceful, but I see numerous issues if we just crash.

Could you describe them?

You mentioned this multiple times "things can get worse", can you provide me a real example what can break in this specific case?

So my concern are scenarios like:

  1. A WW instance runs
  2. A component breaks (e.g. we crash on a blockchain reorg) -> we gracefully shut down and we store some state that is neither "before reorg" or "after reorg"
  3. User starts another instance of WW and her balance is invalid because we won't re-process some blocks.

Something like this.

Anyway, the question is a philosofical one, not that much practical1. I'm saying "if a part of application crashes, attempt to terminate it immediately, not to make things worse" (a pessimistic approach). You are saying "if a part of application crashes, then it's still safe to terminate gracefully" (that's an optimistic approach). Note that there are still scenarios when you don't have it under your control -- when a machine battery dies.

I don't believe that there is an approach that would work always. It feels like all approaches are compomises.

edit: This is what I basically talk about https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html#guidelines-for-error-handling.

Footnotes

  1. Whatever example I come up with, you can say "it will be ok" (and be right) but then the code changes and nobody is checking how services can crash, what the relationships are, etc. So it means that if you are right now, you might not be right tomorrow.

Yes I agree, the cases have to be evaluated one by one. However, we could set up a rule of what is the default crash and what can be used only if there is a very strong reason - to help future decisions and minimize confusion.

molnard avatar Feb 19 '24 21:02 molnard

I'll try to summarize my thoughts a bit more:

I would add that my UnreachableExceptions are not really supposed to reach production. So the idea was that one throws UEs when the application hits a code path that is totally wrong and it should never happen[^1]. So if the application then crashes, it's OK because it will force people to fix it. This is true for my approach and your approach. So that's good. Once the application is sufficiently tested and no UE is thrown, then the software can be released (nothing new, we test before each release; however, any instance of crash reporter being invoked is much harder to ignore than "it feels like something wrong happened but I'm not really sure")

Anyway, then there are two approaches what to do when such a big bug happens:

  • Graceful shutdown
    • Pro: state is not lost (IMO it's questionable pro though)
    • Con: By definition "if something weird happened and storing the application state after something weird happened can result in invalid state being stored"
  • Crash immediately
    • Pro: Some state can be lost (possibly some block will need to be scan again, etc.)
      • Personally, I don't see this as a smaller issue than storing something invalid later on (my reog example above)
    • Con: If a file is being written to, then there is a risk of partial write which would corrupt the file
      • Note that this is not something new though. If a machine is out of battery, then the very same thing can happen. So the underlying problem is there. And the problem would be fixed by using an SQLite database that guarantees that a transaction either happens or not.

Now regarding https://github.com/zkSNACKs/WalletWasabi/pull/12395#issuecomment-1953174380, I'll try to address it one by one:

[^1]: Meaning, whoever implemented that certainly did not put effort to mitigate that state and it might not be mitigable at all. In short, there is a big bug that must be fixed.

  • Use in Libraries: If you're developing a library that is used by other applications, calling Environment.Exit within a library is highly discouraged. It can unexpectedly terminate the host application, leading to a poor user experience and making it difficult for applications using your library to manage their lifecycle properly.

We would not put Environment.Exit in components. The idea was to throw UnreachableException and then on Global level, we would register what happens with background services when they fail with such exception. So the classes themselves would throw UEs and anyone who uses WW as a library would observe an UE being thrown, certainly not Environment.Exit.

  • Skipping Finalization: Some finalizers (destructors) may not run if "crash" is used. The runtime tries to run finalizers if Environment.Exit is called, but if the exit is immediate and the finalizers take too long, they may not complete.

I don't think we use finalizers really. But even if we do, then you are not guaranteed that they are called

  • Application State: If the application maintains state or has temporary data that needs to be saved or persisted, calling Environment.Exit without ensuring this data is saved can result in data loss.

This is true. However, we try to establish here if risk of losing some data is worse than risk of storing invalid data. This is not a trivial question. Your approach can be better, mine can be better, or it depends on a situation. So the question should be narrowed down to "what is more likely be better", I guess.

  • External Resources and Commitments: If your application is holding external resources (like database connections, file handles, network connections, etc.), using Environment.Exit might not allow for these resources to be released or committed properly. This could lead to data corruption, lost transactions, or other issues with resources that were being managed by the application.

So once the application terminates, file handles, network connections and other resources are released.

If an instance of bitcoind is started and a crash occurs, then it might not be stopped (btw: this is what happens in integration tests all the time).

Tor is either left to be running (by design), or there is an option to stop the process once WW terminates (AFAIK).

The next step: We can pick either my approach or your approach and just implement it and we can later on re-evaluate if the decision was good or not (in my mind, that should be a minimal change to change "graceful" -> "crash" or vice versa)


We should not really use files for storage purposes. History has IMO shown good enough that it has limitations. A database is more robust and allows to do more with the data.

kiminuo avatar Feb 20 '24 11:02 kiminuo

The rule for unmanaged exceptions is to crash. Also, the reference to TerminateService looks ni good.

lontivero avatar Feb 20 '24 12:02 lontivero

This is true. However, we try to establish here if risk of losing some data is worse than risk of storing invalid data. This is not a trivial question. Your approach can be better, mine can be better, or it depends on a situation. So the question should be narrowed down to "what is more likely be better", I guess.

Do it as we do now and did in the past years. Signal for termination.

We can only crash if all the issues mentioned above like File writings are made crash-safe - not the way around. But I guess that's not worth it. I suggest starting with graceful and investigating the reorg stuff as well.

molnard avatar Feb 20 '24 16:02 molnard

Ok so well it has been a mess :D I'm pretty convinced myself this PR was not the way to go, but as I understood we chose to go with it. Maybe it will just be temporary?

Master is merged, IMO this PR is ready to go, but still a few questions @molnard :

  1. Should we implement Lucas' idea of using an event? It's basically just a style change over the function that I am using here.
  2. Should we crash in the same way we Force Kill instead of the same way we CTRL + C? This is an echo to something lucas said as well that we shouldn't crash gracefully. Not sure what the conclusion was on this.
  3. Should I implement the crash in the same services Kimi did in #12606, which makes sense even if probably not super required

turbolay avatar Mar 14 '24 18:03 turbolay

Maybe it will just be temporary?

No, it is the final solution. This argument caused a major distraction in our team (typical minefield category). I will try to avoid getting back to this for a while by making more powerful/forceful decisions for the sake of team integrity.

  • Evaluate the impact on the business.
  • Take the simplest solution that works.
  • Forget about it and focus on "productive" things.

Should we implement Lucas' idea of using an event? It's basically just a style change over the function that I am using here.

It is OK.

Should we crash in the same way we Force Kill instead of the same way we CTRL + C?

No. I want graceful shutdowns.

Should I implement the crash in the same services Kimi did in

It is fine.

molnard avatar Mar 15 '24 13:03 molnard