zed icon indicating copy to clipboard operation
zed copied to clipboard

Abrupt power loss causes 0-length lake.zng file, preventing "zed serve" startup

Open philrz opened this issue 3 years ago • 0 comments

Repro is with Zed commit 1c906e4. With the steps described here, the issue reproduces about 50% of the time.

In the video below, we see a zed serve started from an empty lake directory, which creates a 53 byte lake.zng file. I then abruptly kill the power to my Linux VM. When the system boots back up, the lake.zng file is found to now be at 0 bytes. At that point zed serve refuses to start again on that lake directory.

https://user-images.githubusercontent.com/5934157/189450975-3a2cc0d3-e9f6-4a2c-b3f4-d5c58372d788.mp4

I respect that it may be infeasible for us to protect ourselves from every kind of filesystem disruption that might come up in scenarios such as power loss. But I know we also pride ourselves on leaving the lake in a consistent state, so maybe there's something we could improve here.

That's the simplest possible repro recipe, but here's the wider context in which I unintentionally ran into this problem. I often do Zui app testing in a Linux VM in VirtualBox like is shown here. I've noticed a problem on my Mac laptop where if I yank any physical USB devices while such a VM is running, VirtualBox abruptly crashes and leaves the VM in an "Aborted" state. When this happened recently and I booted the VM back up again, it was in this state with the 0-length lake.zng file. In that case because Zui was not able to start the Zed service, it came up in this state:

image

Unfortunately, even dropping to the shell and looking in Zui's logs/ directory did not show the same message I saw when I ran zed serve at the shell as shown in the video. The main.log looks like:

[2022-09-09 15:03:19.786] [info] Booting main with: {
  lakePort: 9867,
  lakeRoot: '/home/phil/.config/Zui/lake',
  lakeLogs: '/home/phil/.config/Zui/logs',
  lake: true,
  devtools: false,
  appState: '/home/phil/.config/Zui/appState.json',
  releaseNotes: true,
  autoUpdater: true,
  singleInstance: true
}
[2022-09-09 15:03:19.833] [info] Migrations pending: 0
[2022-09-09 15:03:19.871] [debug] Initializing: auto-update.js
[2022-09-09 15:03:19.872] [debug] Initializing: custom-protocol.js
[2022-09-09 15:03:19.883] [debug] Initializing: log-filters.js
[2022-09-09 15:03:19.884] [debug] Initializing: secure-web-contents.js
[2022-09-09 15:03:19.890] [debug] Initializing: shortcuts.js
[2022-09-09 15:03:19.893] [debug] Initializing: user-tasks.js
[2022-09-09 15:03:19.894] [debug] Initializing: window-events.js
[2022-09-09 15:03:19.894] [info] Initializers loaded
[2022-09-09 15:03:19.896] [debug] IPC Listening: windows.autosave
[2022-09-09 15:03:19.896] [debug] IPC Listening: closeWindow
[2022-09-09 15:03:19.897] [debug] IPC Listening: featureFlags
[2022-09-09 15:03:19.897] [debug] IPC Listening: getAppMeta
[2022-09-09 15:03:19.898] [debug] IPC Listening: getGlobalState
[2022-09-09 15:03:19.898] [debug] IPC Listening: getWindowState
[2022-09-09 15:03:19.898] [debug] IPC Listening: dispatchGlobalFromWindow
[2022-09-09 15:03:19.898] [debug] IPC Listening: dispatchFromMain
[2022-09-09 15:03:19.900] [debug] IPC Listening: importQueries
[2022-09-09 15:03:19.900] [debug] IPC Listening: mainArgs
[2022-09-09 15:03:19.901] [debug] IPC Listening: moveToCurrentDisplay
[2022-09-09 15:03:19.901] [debug] IPC Listening: openAboutWindow
[2022-09-09 15:03:19.901] [debug] IPC Listening: detailWindow.open
[2022-09-09 15:03:19.902] [debug] IPC Listening: detailWindow.setup
[2022-09-09 15:03:19.902] [debug] IPC Listening: openDirectory
[2022-09-09 15:03:19.902] [debug] IPC Listening: openSearchWindow
[2022-09-09 15:03:19.903] [debug] IPC Listening: resetState
[2022-09-09 15:03:19.903] [debug] IPC Listening: runCommand
[2022-09-09 15:03:19.904] [debug] IPC Listening: setSecretOp
[2022-09-09 15:03:19.904] [debug] IPC Listening: getSecretOp
[2022-09-09 15:03:19.904] [debug] IPC Listening: deleteSecretOp
[2022-09-09 15:03:19.904] [debug] IPC Listening: showPreferencesOp
[2022-09-09 15:03:19.905] [debug] IPC Listening: showReleaseNotes
[2022-09-09 15:03:19.905] [debug] IPC Listening: showSaveDialogOp
[2022-09-09 15:03:19.906] [debug] IPC Listening: updateSearchAppMenu
[2022-09-09 15:03:21.201] [debug] IPC Handling: getWindowState
[2022-09-09 15:03:21.202] [debug] IPC Handling: getGlobalState
[2022-09-09 15:03:21.224] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.237] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.238] [debug] IPC Handling: featureFlags
[2022-09-09 15:03:21.256] [debug] IPC Handling: mainArgs
[2022-09-09 15:03:21.261] [debug] IPC Handling: getAppMeta
[2022-09-09 15:03:21.265] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.288] [debug] IPC Handling: updateSearchAppMenu
[2022-09-09 15:03:21.461] [debug] IPC Handling: windows.autosave
[2022-09-09 15:03:21.462] [debug] Session Autosaved
[2022-09-09 15:03:21.933] [debug] IPC Handling: getWindowState
[2022-09-09 15:03:21.933] [debug] IPC Handling: getGlobalState
[2022-09-09 15:03:21.952] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.960] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.961] [debug] IPC Handling: featureFlags
[2022-09-09 15:03:21.963] [debug] IPC Handling: mainArgs
[2022-09-09 15:03:21.964] [debug] IPC Handling: getAppMeta
[2022-09-09 15:03:21.973] [debug] IPC Handling: dispatchGlobalFromWindow
[2022-09-09 15:03:21.993] [debug] IPC Handling: windows.autosave
[2022-09-09 15:03:21.993] [debug] Session Autosaved
[2022-09-09 15:03:22.270] [debug] IPC Handling: windows.autosave
[2022-09-09 15:03:22.496] [debug] Session Autosaved
[2022-09-09 15:03:22.987] [debug] IPC Handling: windows.autosave
[2022-09-09 15:03:22.987] [debug] Session Autosaved

And the zlake.log looked like:

{"level":"info","ts":1662760999.9224832,"msg":"Open files limit raised","limit":1048576}
{"level":"info","ts":1662760999.9225676,"msg":"Listening to brim process pipe","fd":"brimfd"}

So there's room for improvement here with how Zui handles this kind of catastrophic zed serve failure, since when bugs like this show up in the field, the logs are practically our last line of defense. I'm now reminded that this is a problem we've bumped into before as described at https://github.com/brimdata/brim/issues/2447#issuecomment-1189520229.

philrz avatar Sep 09 '22 22:09 philrz