sdb command failure should generally not exit sdb

When there's a bug or poorly-handled error in a specific command, it shouldn't cause sdb to exit. Instead, we should print an error message (like the stack trace that it currently does) and then return to the sdb> prompt.

Nov 15 '19 05:11 ahrens

@ahrens can you provide a concrete example? I think what you describe is generally already the goal, so I'm wondering if there's a specific bug/command to be fixed where we're not adhering to that goal?

Nov 15 '19 19:11 prakashsurya

Although, I guess the design we've taken is to only catch SDB specific exceptions, and keep those from exiting the session. We decided that if any non-SDB specific exception is raised we should cause that to crash the session; these non-SDB exceptions should be caught by the individual commands, and converted to SDB specific exceptions.

Do you think this design is wrong and/or needs improvement? e.g. since a bug in any specific command could crash the session?

IIRC, we decided on the current design, with the idea that if a command has a bug and doesn't catch an exception that it should, it's better to crash and hopefully get a bug report (so we can fix the bug), rather than "hide" the error and continue on.

Nov 15 '19 19:11 prakashsurya

For example, I should be able to write a command that uses assert. If the assertion fails, I don't think that implies any damage to the sdb REPL, so we can safely return to the command prompt. Or as another example, see https://github.com/delphix/sdb/issues/72. This could also be considered a bug in pp, but why does it have to exit sdb?

IIRC, we decided on the current design, with the idea that if a command has a bug and doesn't catch an exception that it should, it's better to crash and hopefully get a bug report (so we can fix the bug), rather than "hide" the error and continue on.

I think there's a 3rd option which is to expose the error, hopefully get a bug report, and continue on.

The bash shell does not exit when you use it to run a buggy program. That's the usability standard that we should be aiming for.

Nov 15 '19 19:11 ahrens

For example, I should be able to write a command that uses assert.

Can you comment on why you feel one should be able to use assert? I'm not sure I completely agree. For example, we could create some SDB specific functionality to do what you're suggesting, that does not exit the session.

The bash shell does not exit when you use it to run a buggy program.

I'm not sure I agree with this example, since the shell gets this functionality via fork/exec, unless you're implying we should also use fork/exec (or some other isolated environment)?

For example, If the shell executed the command in the same context as the shell process, and that command caused a segfault, the shell would be killed along with the command. The segfault in my example being akin to an exception in python.

What should we do if a command calls sys.exit? Should that exit the session? Or just start a new prompt? I don't mean to be argumentative, genuinely curious where the bounds would be for any design we attempt to adopt.

For #72 specifically, I argue that's a bug in pp, in that it should not be throwing a TypeError exception.

Nov 15 '19 20:11 prakashsurya

I'm saying that my experience with sdb today is that I hit tons of problems in commands that cause sdb to exit. I'm open to ideas on how to address that. If that means "don't use assert, use sdbassert(), that's fine with me"

Nov 15 '19 21:11 ahrens

So here is my opinion on the issue.

Ideally it would be great if all commands had no issues and would take care of all their possible edge case. That is something that we should strive for and ask in code reviews when introducing commands. Looking into the long-term though if SDB commands/modules are to be decoupled from the SDB repo, it can be harder to have control over the quality of the commands imported by SDB in the runtime.

Catching all exceptions from the REPL may be a bit ugly as a programming practice in general. That said in this case, not catching these errors leads to the worst outcome in terms of user-experience that you can get from SDB - getting dropped out of SDB with a non-friendly error message. Furthermore, given that there is no state maintained in the REPL between commands, attempting to recover after a command error is not that bad and most probably won't lead to any problems in future commands issued within the same session.

Just for the above reasons I think that we should probably handle all exceptions and maintain the session. Now how and where we do that is a different questions.

Where we do that?...The first obvious answer is at the REPL code - it is decoupled from everything and can be a self-contained path that will literally catch anything. Another place could be anywhere that commands are issued (e.g. wherever we call the call() method from the execute_pipeline() function) this way we can be specific and handle all exceptions that come from actual commands but not the SDB infrastructure - this is a middle-ground but IMHO not a great solution because there can be multiple places where logic from commands sneaks in the main SDB infrastructure.

How do we do that?...It depends what our goal is - we probably want the user to submit an issue on Github so we print the stack trace, together with the command executed and the target, and print the link that gets them to the New Issue page on Github so they can copy paste their output and submit it if they want too (all of that while still maintaining their session)... I'd like that but it may be too verbose or intrusive for some users, so maybe we can introduce a -v flag as an argument when you first start SDB to optionally do that. There are multiple ways that we can think of this.

In any case, I believe we should not exit SDB just because a command happened to be buggy. Maybe we are ok with this because we are the main users and we also know that "we can always fix the problem later" but for a complete beginner crashing and exiting is discouraging.

Nov 16 '19 00:11 sdimitro

Another place could be anywhere that commands are issued (e.g. wherever we call the call() method from the execute_pipeline() function) this way we can be specific and handle all exceptions that come from actual commands but not the SDB infrastructure - this is a middle-ground but IMHO not a great solution because there can be multiple places where logic from commands sneaks in the main SDB infrastructure.

I think this would be ideal. The primary problem is broken commands (including ones from 3rd parties, as you pointed out). We want to be able to point the finger to the broken command, so that when a user runs a complicated pipeline, they know who to blame. The buggy code could even be in a command that doesn't appear explicitly on the CLI, but is invoked implicitly by pp, or walk. We would want walk to be able to say, we ran walker X and it threw exception Y, which indicates a bug in the X code. Here's the information you need to file a bug: <stack trace, perhaps link to github, etc>.

That said, I agree that would be more work. I'd be OK with catching in the REPL code as well, and they'll have to look at the stack to figure out if the bug is in core sdb, an sdb-provided command, or a 3rd party command. In practice this will probably be good enough.

Ideally we would wrap anywhere that calls out to an arbitrary subclass:

call()
pretty_print()
walk()
Locators (@InputHandler-annotated methods)

I took a look and these are mostly covered in a few places but there are definitely some odds and ends:

2 places in the main sdb class that call() (execute_pipeline, execute_pipeline_term)
Locator.caller: yield from method(i), which invokes the locator; and Walk(...).call([i])
PrettyPrint.call: pretty_print([i])
Spa.pretty_print invokes Vdev.pretty_print
Vdev.pretty_print invokes Metaslab.pretty_print

Nov 16 '19 03:11 ahrens

Would we want the "catch all exceptions" behavior to apply when the commands are used programmatically? e.g. from the unit tests? or perhaps from 3rd party python scripts (if one were to ever exist)?

I haven't looked too closely at the code w.r.t. this issue, but at first thought, I feel like we'd only want this behavior when the user calls the sdb command, and not when they import sdb and use the commands directly via python.

Nov 19 '19 18:11 prakashsurya

Another argument in favor of this behavior (not exiting sdb on command failure) is when there are no bugs in sdb or the commands, but the command code is mismatched with the target. This could happen due to misconfiguration (e.g. running new sdb on an older crash dump), or because the command code hasn't been updated to reflect changes in the kernel (e.g. data structure changed, like #66). In both cases, "something isn't right", but it isn't an issue of being more careful when writing sdb or commands.

FWIW, mdb handled this pretty well. You'd generally get an error message like, could not find member avl_numnodes of type zfs_btree_t* and then the dcmd would exit. (You didn't get the "debug with self?" prompt.)

Nov 20 '19 16:11 ahrens

Filed https://github.com/delphix/sdb/issues/87 and https://github.com/delphix/sdb/issues/88

Nov 20 '19 17:11 sdimitro

Yea, for #66 we should probably be catching the AttributeError exception in some common place, so that we don't have to have the handling for that scattered across all commands. I also think it'd be better if drgn was to throw a drgn-specific exception when that occurs; then we could tailor the handling of the exception to be specific to that issue (e.g. not emit a stack trace, etc.).

Nov 20 '19 17:11 prakashsurya

Yea, for #66 we should probably be catching the AttributeError exception in some common place, so that we don't have to have the handling for that scattered across all commands. I also think it'd be better if drgn was to throw a drgn-specific exception when that occurs; then we could tailor the handling of the exception to be specific to that issue (e.g. not emit a stack trace, etc.).

I agree, we can try and do this ourselves but drgn probably needs to solve the same problem so it could be nice doing it once in drgn and have sdb consume it. I'll add this on the agenda for the upcoming meeting - we can also ping Omar on slack if we want feedback sooner

Nov 20 '19 17:11 sdimitro