zui icon indicating copy to clipboard operation
zui copied to clipboard

Better "zed serve" process management

Open philrz opened this issue 6 years ago • 2 comments

A few issues regarding zed serve process management:

  1. Brim ignores unexpected zed serve shutdowns. Would be nice for zed serve to have some kind of automatic restart with growing retry intervals (similar to the behavior of kubernetes pods). If Brim is unable to restart zed serve, the client should be alerted with some kind of diagnostic information to provide support (maybe a link to generate a github issue).

  2. If Brim experiences an ungraceful shutdown (e.g. kill -9 brim) zed serve does not also receive the signal and shutdown. If we take care of the above problem, this becomes an issue because on subsequent restarts of brim, zed serve would be unable to bind to :9867 because the port would be in use by the orphaned zed serve process.

  3. We also need to consider the case of a host having another service listening on :9867. Perhaps this is another version of zed serve that a user is purposefully running- the application should notify the user of this case and confirm if they want to continue running against an unmanaged zed serve process. If it is an unknown service the application should prompt the user to terminate the service bound to this port.

philrz avatar Apr 01 '20 18:04 philrz

Here's some specific examples of what could happen to a user in each of the above scenarios.

  1. If I start Brim GA tagged v0.12.0 and then kill -9 the zqd process, then try to import a pcap into the app, I get the obscure/generic "Failed to fetch" error.

image

  1. Let's say I've been running Brim GA tagged v0.10.0 and then Brim "experiences an ungraceful shutdown", simulated in this case by doing a kill -9 on the Brim process. Now let's say the user stops using the app for some time, unaware the orphaned zqd process is left behind. Perhaps they install the more recent GA version tagged v0.12.0 before they next sit down to use the app. When they try to import data into the app, now they may experience problems due to incompatibilities between the API the current zqd expects and the old one that's actually running, such as:

image

  1. Let's say I happen to have a non-zqd process already listening on 9867, such as with this Python:
$ python3
Python 3.7.6 (default, Feb 27 2020, 06:03:07) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> serversocket.bind((socket.gethostname(),9867))
>>> serversocket.listen(1)

Now when I start my Brim app, I'm greeted with a blank white screen that hangs indefinitely:

image

philrz avatar Jun 29 '20 21:06 philrz

Here's a sketch of a heartbeating mechanism that would be useful:

  • add a zqd cli flag that takes a random string & a time, -killcord 2s,eaf3232...
  • when present, zqd accepts POST requests to /killcord whose body should contain the same random string as given on the cli.
  • if zqd doesn't receive a post to that route every N seconds (as defined by the cli option), it logs a fatal message and stops.
  • if zqd receives a post with a different random string, it logs a fatal message & stops.

Brim will then launch zqd with the new option, and <N seconds, send killcord post with current the ident. If the killcord post fails, Brim will re-gen a new random ident & launch a new zqd instance.

This covers these cases:

  • Brim is killed, but its zqd process is not reaped: the zqd process will stop itself since it won't receive a killcord post.
  • Brim is restarted, but for some reason a pre-existing zqd instance is still up: the zqd process will stop itself; it will get a killcord post with the wrong ident.

alfred-landrum avatar Jul 06 '20 16:07 alfred-landrum