scylla-operator
scylla-operator copied to clipboard
Scylla container doesn't go into Error state when any of Scylla DB setup steps fails.
Describe the bug In another bug [1] was observed situation where "scylla_io_setup" step failed, but "scylla" container stayed in "Running" state. Then, "scylla-manager-agent" container in the same pod failed to become working not being able to reach Scylla API. So, such way of failing is not obvious and we should explicitly see that "scylla" container is in Error state when "Scylla DB" setup fails for any reason.
[1] https://github.com/scylladb/scylla-operator/issues/454
Example of a failure in "scylla" container:
running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/opt/scylladb/scripts/scylla_cpuset_setup', '--cpuset', '0-7'],)
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
Problem when parsing disks from OS:
found more than one disk mounted at root
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla_io_setup", line 239, in <module>
if idata.is_recommended_instance():
File "/opt/scylladb/scripts/scylla_util.py", line 311, in is_recommended_instance
diskSize = self.firstNvmeSize
File "/opt/scylladb/scripts/scylla_util.py", line 290, in firstNvmeSize
ephemeral_disks = self.getEphemeralOsDisks()
File "/opt/scylladb/scripts/scylla_util.py", line 177, in getEphemeralOsDisks
return self.os_disks[self.EPHEMERAL]
File "/opt/scylladb/scripts/scylla_util.py", line 169, in os_disks
nvmes_present = self._non_root_nvmes()
File "/opt/scylladb/scripts/scylla_util.py", line 155, in _non_root_nvmes
raise Exception("found more than one disk mounted at root ".format(root_dev_candidates))
Exception: found more than one disk mounted at root
failed!
Traceback (most recent call last):
File "/docker-entrypoint.py", line 27, in <module>
setup.io()
File "/scyllasetup.py", line 67, in io
self._run(['/opt/scylladb/scripts/scylla_io_setup'])
File "/scyllasetup.py", line 37, in _run
subprocess.check_call(*args, **kwargs)
File "/opt/scylladb/python3/lib64/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.
To Reproduce Steps to reproduce the behavior:
- Install scylla-operator
- Install scylla-4.3.0
- See that "scylla-manager-agent" fails in the first "scylla" pod and "scylla" container doesn't fail having real failures according to it's logs.
Expected behavior "scylla" container in "Error" state and "scylla-manager-agent" container doesn't try to run while "scylla" container doesn't work.
Config Files If relevant, upload your configuration files here using GitHub, there is no need to upload them to any 3rd party services
Logs
Please, provide kubectl get events
, kubectl logs -n scylla pod-name
and other relevant information.
Environment:
- Platform: GKE
- Kubernetes version: 1.17.15-gke.800
- Scylla version: 4.3.0
- Scylla-operator version: e.g.: e.g.: v1.1.0-rc.2-1-g9b93ca0
Additional context Add any other context about the problem here.
We just use the container image that scylla produces. It likely needs an adjustment in supervisord not to restart the process and leave that to the container runtime and the restart policy. Might be best to file an isolated case there https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/Dockerfile#L17
Until we decide to build a real container without supervisord and use sidecards, I don't see much we can do on the operator side.
@tnozicka According to the entrypoint python file [1] used in the "scylla" image you referred to, "supervisord" start step is not reached yet when failure appears, look at [2].
And, I guess, the problem in the "except Exception" block [3] where any python exception is just suppressed and script always exits with "0" code.
[1] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/Dockerfile#L50 [2] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/docker-entrypoint.py#L27 [3] https://github.com/scylladb/scylla/blob/2c8dcbe5c5a85bbf9cb591520e959a222a292eab/dist/docker/redhat/docker-entrypoint.py#L33
Good catch, I wasn't looking that deep, my point was that it's out of our control. Wanna file a bug for github.com/scylladb/scylla with your findings?
@tnozicka https://github.com/scylladb/scylla/issues/8290
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 30d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out
/lifecycle stale
tracked in https://github.com/scylladb/scylladb/issues/8290