Zed process failure during Export shows no error and leaves partial results

Open philrz opened this issue 4 years ago • 1 comments

Repro is with Brim commit 9f06e82 which uses Zed commit 77c760f.

Before the attached video begins, I've already imported the Zeek TSV Zed-sample-data. On this system if I click Export to turn the contents of the Pool into a single ZNG file on my desktop, the operation typically completes in about 8 seconds. However, I've found that if I kill the Zed process mid-export, the "Exporting..." spinner just keeps going indefinitely (during testing I've left it for over an hour).

https://user-images.githubusercontent.com/5934157/130137786-642a8f23-5fa7-4623-b5f8-554d0d1cb80f.mp4

Obviously, having the backend spontaneously die is not something we expect to happen often, though it has been known to happen. For instance, I happened to find this issue because I was trying to repro a separate issue where the Zed backend spontaneously died and we don't yet understand why.

It should also be noted that if I attempt other operations in the app at this point that depend on the backend, I do get the expected "service could not be reached" messages, so it's not like the user is kept completely in the dark. I expect it would only be frustrating if they sat there waiting a long time on either the spinner to exit or an error message, and these never come.

I don't know if there's other spots in the app where this kind of problem could arise, but perhaps we could benefit from some kind of standard heartbeat-enhanced communication channel for talking to the backend so we could provide graceful failure messages after reasonable timeouts?

Aug 19 '21 20:08 philrz

I revisited this problem to see if it's changed with recent updates. It does indeed to have lowered in severity, though there's still room for improvement.

On Windows where the problem was first identified, killing the Zed process no longer causes the "Exporting..." spinner to continue indefinitely. As shown in the video below taken with GA Brim tagged v0.31.0, after about 25 seconds, the "Exporting..." spinner does quietly disappear. At this point if the user attempts another action in the app that hits the network, they do see an error message indicating the problem communicating with the backend. Therefore this is an improvement since the failure condition would likely make them more prepared to expect a corrupt export file.

https://user-images.githubusercontent.com/5934157/186000894-afee159b-b7c3-41c0-a78b-a095b1b57aa8.mp4

I also confirmed the same is true with current Brim, via a test with Zui Insiders 0.30.1-50.

https://user-images.githubusercontent.com/5934157/186001288-84ccaf65-c403-46df-9e50-7383bd1b6719.mp4

On macOS, the symptom is a little different. in this case the "Exporting..." spinner seems to quietly go away immediately when the Zed process is killed. The partial results file is left behind. Like we saw with Windows, interacting further with the app does produce a "The service could not be reached" error.

https://user-images.githubusercontent.com/5934157/186001665-9c6e9185-e73e-4d46-917b-425efc74cbf5.mp4

On Linux, the symptom was much the same as just shown in this macOS video.

In conclusion, it does seem like we could still benefit from some explicit error messaging so the user knows as soon as the "Exporting..." spinner goes away that Zed didn't finish the operation cleanly and hence they should not trust the partial results that were left behind.

Aug 22 '22 19:08 philrz