Better "zed serve" process management
A few issues regarding zed serve process management:
-
Brim ignores unexpected
zed serveshutdowns. Would be nice forzed serveto have some kind of automatic restart with growing retry intervals (similar to the behavior of kubernetes pods). If Brim is unable to restartzed serve, the client should be alerted with some kind of diagnostic information to provide support (maybe a link to generate a github issue). -
If Brim experiences an ungraceful shutdown (e.g.
kill -9 brim)zed servedoes not also receive the signal and shutdown. If we take care of the above problem, this becomes an issue because on subsequent restarts of brim,zed servewould be unable to bind to:9867because the port would be in use by the orphanedzed serveprocess. -
We also need to consider the case of a host having another service listening on
:9867. Perhaps this is another version ofzed servethat a user is purposefully running- the application should notify the user of this case and confirm if they want to continue running against an unmanagedzed serveprocess. If it is an unknown service the application should prompt the user to terminate the service bound to this port.
Here's some specific examples of what could happen to a user in each of the above scenarios.
- If I start Brim GA tagged
v0.12.0and thenkill -9thezqdprocess, then try to import a pcap into the app, I get the obscure/generic "Failed to fetch" error.

- Let's say I've been running Brim GA tagged
v0.10.0and then Brim "experiences an ungraceful shutdown", simulated in this case by doing akill -9on the Brim process. Now let's say the user stops using the app for some time, unaware the orphanedzqdprocess is left behind. Perhaps they install the more recent GA version taggedv0.12.0before they next sit down to use the app. When they try to import data into the app, now they may experience problems due to incompatibilities between the API the currentzqdexpects and the old one that's actually running, such as:

- Let's say I happen to have a non-
zqdprocess already listening on9867, such as with this Python:
$ python3
Python 3.7.6 (default, Feb 27 2020, 06:03:07)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> serversocket.bind((socket.gethostname(),9867))
>>> serversocket.listen(1)
Now when I start my Brim app, I'm greeted with a blank white screen that hangs indefinitely:

Here's a sketch of a heartbeating mechanism that would be useful:
- add a zqd cli flag that takes a random string & a time,
-killcord 2s,eaf3232... - when present, zqd accepts POST requests to
/killcordwhose body should contain the same random string as given on the cli. - if zqd doesn't receive a post to that route every N seconds (as defined by the cli option), it logs a fatal message and stops.
- if zqd receives a post with a different random string, it logs a fatal message & stops.
Brim will then launch zqd with the new option, and <N seconds, send killcord post with current the ident. If the killcord post fails, Brim will re-gen a new random ident & launch a new zqd instance.
This covers these cases:
- Brim is killed, but its zqd process is not reaped: the zqd process will stop itself since it won't receive a killcord post.
- Brim is restarted, but for some reason a pre-existing zqd instance is still up: the zqd process will stop itself; it will get a killcord post with the wrong ident.