fluvio icon indicating copy to clipboard operation
fluvio copied to clipboard

Make local cluster start idempotent

Open tjtelan opened this issue 4 years ago • 8 comments

If we try to start a cluster twice, we don't always get useful error messages.

$ fluvio cluster start --local
Performing pre-flight checks
✅ ok: Supported helm version is installed
✅ ok: Supported kubernetes version is installed
✅ ok: Kubernetes config is loadable
✅ ok: Fluvio system charts are installed
Successfully installed Fluvio!

$ fluvio cluster start --local
Performing pre-flight checks
✅ ok: Supported helm version is installed
✅ ok: Supported kubernetes version is installed
✅ ok: Kubernetes config is loadable
✅ ok: Fluvio system charts are installed
Error:
   0: Fluvio cluster error
   1: Fluvio cluster error
   2: Failed to install Fluvio locally
   3: Kubernetes client error
   4: client error: 409 Conflict

This error manifests during development often. We might forget that we have a local cluster started. Or we used to have a cluster started locally, but we've killed the process.

In my opinion, there are 2 different levels to address this.

  1. Behave similar to fluvio cluster start (Kubernetes). If a cluster appears to be running, we just say so. Otherwise, we suggest running fluvio cluster delete --local to clean up metadata.

ex. Successful kubernetes startup idempotency

$ fluvio cluster start
✅ ok: Kubernetes config is loadable
✅ ok: Supported helm version is installed
✅ ok: Load balancer is up
✅ ok: Fluvio system charts are installed
💙 note: Fluvio is already running
Successfully installed Fluvio!
  1. We attempt to clean up however we can, and try to start a local cluster.

In my mind, this is effectively running fluvio cluster delete --local && fluvio cluster start --local on behalf of the user. We can inform them while we are cleaning up.

We could have option 1 be the default behavior, and provide option 2 behind a --force flag. The goal in mind is a scriptable path to starting a local cluster w/o needing to handle errors.

tjtelan avatar Apr 28 '21 01:04 tjtelan

One possible solution to this is to make the SC create a pid lockfile when it starts up, something like /usr/local/var/fluvio/sc.pid, which contains the process id of the currently-running SC in it. Then before starting up, the cluster installer can check for a lockfile and for a process with the same pid. If the lockfile or the process do not exist, then we can preemptively clean out any old metadata from the store so that we do not encounter the 409 conflict.

nicholastmosher avatar Apr 28 '21 13:04 nicholastmosher

This needs to work across different OS and cluster types (local,k8). We certainly need fluvio cluster status API/Command similar to: https://minikube.sigs.k8s.io/docs/commands/status/.

Key points

  • User may have multiple clusters which may or may not be running locally
  • They need to switch between different clusters.

Note that this is admin management of clusters different from users's perspective of clusters which they are just consumer/producer.

sehz avatar Apr 28 '21 16:04 sehz

Stale issue message

github-actions[bot] avatar Aug 30 '21 11:08 github-actions[bot]

Stale issue message

github-actions[bot] avatar Nov 21 '21 11:11 github-actions[bot]

Stale issue message

github-actions[bot] avatar Jan 23 '22 11:01 github-actions[bot]

@sehz Since #2044 added a check for a running local cluster, is a status API still necessary to resolve this? Or is it sufficient to simply make the nonrecoverable AlreadyInstalled and ExistingLocalCluster errors (in check/mod.rs) cause the CLI to exit with a 0 exit code (rather than an error code) and a message stating that the cluster is already running?

BatmanAoD avatar Aug 28 '22 19:08 BatmanAoD

Yes, it's OK to return a non-recoverable error.

sehz avatar Aug 28 '22 20:08 sehz

@sehz Sorry, I think I wasn't clear; those are errors that already exist in the code, and are considered "nonrecoverable" (they're part of the UnrecoverableCheckStatus enum). My question is whether changing the behavior when those are encountered, so that the process exits with a "success" status (0), would be sufficient to satisfy option 1 for this issue: "If a cluster appears to be running, we just say so."

BatmanAoD avatar Aug 28 '22 21:08 BatmanAoD