nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Start collecting crash reports from end users (opt-in)

Open zah opened this issue 2 years ago • 6 comments

We can use a library such as Google Breakpad to start collecting memory dumps from crashes happening in the wild: https://chromium.googlesource.com/breakpad/breakpad/+/HEAD/docs/getting_started_with_breakpad.md

The feature will be disabled by default, but may be promoted in our guides and installers. The memory dumps will be sent to a server operated by the Nimbus team. Production-ready software for such a server is available from Mozilla: https://github.com/mozilla-services/socorro

Collecting memory dumps requires maintaining a comprehensive archive of build symbols for all released binaries. Again, we can take try to replicate the practices of the Firefox project:

https://firefox-source-docs.mozilla.org/toolkit/crashreporter/crashreporter/Using_the_Mozilla_symbol_server.html

Special case must be taken to ensure no sensitive data is included in the dump (e.g. validator private keys). This can be done with technique discussed in https://github.com/status-im/nimbus-eth2/issues/545.

zah avatar Feb 18 '22 17:02 zah

Eh, this goes against the grain in an ethereum client - there are too many keys and security issues with this, not to mention the privacy issues: our crash-dump server would become a high-value target as well.

arnetheduck avatar Feb 18 '22 19:02 arnetheduck

Techniques like 545 are safety nets really, and should be treated as such - it's too easy to get this wrong.

We've already disabled similar features in the status chat app, for similar reasons.

arnetheduck avatar Feb 18 '22 19:02 arnetheduck

There is a wide spectrum of possible implementations of this feature. On the most conservative end, It could be a compile-time opt-in setting that we use on our fleet or that we recommend to Nimbus and Status contributors who are running nodes in the wild. Currently, we don't have good access to crash data even though this has proved to be useful in the past for debugging various issues. Our process for collecting and working with dumps has been rather ad-hoc and inefficient.

Moving further along the spectrum, you can imagine Nimbus users using on this on testnets or in other lower-risk scenarios.

zah avatar Feb 19 '22 19:02 zah

This is a question that needs a very clear answer: we do not automatically collect anything, and should not have any code near the client that does so, flags or not - nor do we want to operate any services that collect and store user data, ever - this is a risk and a liability that we do not have any interest in taking.

The crashes we have had so far have been trivial to resolve, and the number of them (which can be counted on on hand), does in no way motivate a major and ongoing investment (technical, legal, financial) in this kind of feature.

The rarity of crashes is linked to several reasons: our use a of a mostly-safe languge (which avoids many memory-related crashes), our preference for explict error handling over exceptions (crashes are more common-place in exception-based software, where the norm is to pass and show exception call stacks to the user) and our many processes around QA etc.

In particular, automated crash and memory dumps, will, no matter what we do, sooner or later end up containing user secrets: IP addresses, credentials, keys and anything else floating around in memory: crashes happen when the application enters a region of the codebase has undefined or unintended behavior: that voids any and all protections against, becuase the application is no longer doing what the programmer intended.

That said, what can be done instead is to provide some help for users to be able to submit such reports by themselves - these should be fully text-based such that they can be audited for sensitive information etc and could collect information such as call stacks and last few lines of log to be reviewed by the user.

On the most conservative end, It could be a compile-time opt-in setting

Developers and advanced users capable of getting this far are also able to upload a crash dump or a call stack + log file - there's little difference if the application becomes even moderately helpful in this area.

arnetheduck avatar Feb 22 '22 19:02 arnetheduck

I don't think it's a good idea at all.

Running a validator is very sensitive and suddenly that makes us a very attractive target to collect IPs of validators or worse, service running on their machine plus the CLI options including the ports of RPC endpoints.

mratsim avatar Feb 23 '22 12:02 mratsim

Too dangerous. Even if you manage a perfect implementation, it's not the kind of software that should be phoning home.

stefantalpalaru avatar Feb 23 '22 14:02 stefantalpalaru

I'm closing this as there is no current intentions to implement user-facing crash reporting functionality. We will track the goal of capturing crash reports from our existing fleet of servers in another issue.

zah avatar Jan 02 '24 15:01 zah