scylla-operator Scylla container doesn't go into Error state when any of Scylla DB setup steps fails.

Describe the bug In another bug [1] was observed situation where "scylla_io_setup" step failed, but "scylla" container stayed in "Running" state. Then, "scylla-manager-agent" container in the same pod failed to become working not being able to reach Scylla API. So, such way of failing is not obvious and we should explicitly see that "scylla" container is in Error state when "Scylla DB" setup fails for any reason.

[1] https://github.com/scylladb/scylla-operator/issues/454

Example of a failure in "scylla" container:


running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/opt/scylladb/scripts/scylla_cpuset_setup', '--cpuset', '0-7'],)
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
Problem when parsing disks from OS:
found more than one disk mounted at root 
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in <module>
    if idata.is_recommended_instance():
  File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
    diskSize = self.firstNvmeSize
  File "/opt/scylladb/scripts/scylla_util.py", line 290, in firstNvmeSize
    ephemeral_disks = self.getEphemeralOsDisks()
  File "/opt/scylladb/scripts/scylla_util.py", line 177, in getEphemeralOsDisks
    return self.os_disks[self.EPHEMERAL]
  File "/opt/scylladb/scripts/scylla_util.py", line 169, in os_disks
    nvmes_present = self._non_root_nvmes()
  File "/opt/scylladb/scripts/scylla_util.py", line 155, in _non_root_nvmes
    raise Exception("found more than one disk mounted at root ".format(root_dev_candidates))
Exception: found more than one disk mounted at root 
failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 27, in <module>
    setup.io()
  File "/scyllasetup.py", line 67, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 37, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.

To Reproduce Steps to reproduce the behavior:

Install scylla-operator
Install scylla-4.3.0
See that "scylla-manager-agent" fails in the first "scylla" pod and "scylla" container doesn't fail having real failures according to it's logs.

Expected behavior "scylla" container in "Error" state and "scylla-manager-agent" container doesn't try to run while "scylla" container doesn't work.

Config Files If relevant, upload your configuration files here using GitHub, there is no need to upload them to any 3rd party services

Logs Please, provide kubectl get events, kubectl logs -n scylla pod-name and other relevant information.

Environment:

Platform: GKE
Kubernetes version: 1.17.15-gke.800
Scylla version: 4.3.0
Scylla-operator version: e.g.: e.g.: v1.1.0-rc.2-1-g9b93ca0

Additional context Add any other context about the problem here.

Mar 10 '21 10:03 vponomaryov

We just use the container image that scylla produces. It likely needs an adjustment in supervisord not to restart the process and leave that to the container runtime and the restart policy. Might be best to file an isolated case there https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/Dockerfile#L17

Until we decide to build a real container without supervisord and use sidecards, I don't see much we can do on the operator side.

Mar 15 '21 14:03 tnozicka

@tnozicka According to the entrypoint python file [1] used in the "scylla" image you referred to, "supervisord" start step is not reached yet when failure appears, look at [2].

And, I guess, the problem in the "except Exception" block [3] where any python exception is just suppressed and script always exits with "0" code.

[1] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/Dockerfile#L50 [2] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/docker-entrypoint.py#L27 [3] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/docker-entrypoint.py#L33

Mar 16 '21 11:03 vponomaryov

Good catch, I wasn't looking that deep, my point was that it's out of our control. Wanna file a bug for github.com/scylladb/scylla with your findings?

Mar 16 '21 13:03 tnozicka

@tnozicka https://github.com/scylladb/scylla/issues/8290

Mar 16 '21 15:03 vponomaryov

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

Jun 26 '24 10:06 scylla-operator-bot[bot]

tracked in https://github.com/scylladb/scylladb/issues/8290

Jun 26 '24 15:06 tnozicka

scylla-operator scylla-operator copied to clipboard

Scylla container doesn't go into Error state when any of Scylla DB setup steps fails.

scylla-operator
scylla-operator copied to clipboard