scylla-operator icon indicating copy to clipboard operation
scylla-operator copied to clipboard

Can't start scylla with default helm chart because very small volume size

Open gecube opened this issue 3 years ago • 13 comments

Hello!

I faced the issue that when I follow the instructions described on the page https://operator.docs.scylladb.com/stable/helm.html I couldn't get the running scylla cluster. It looks like that default PV size is 10GB:

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  annotations:
    meta.helm.sh/release-name: scylla-scylla
    meta.helm.sh/release-namespace: scylla
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: scylla
    helm.toolkit.fluxcd.io/namespace: flux-system
  name: scylla-scylla
  namespace: scylla
spec:
  agentRepository: scylladb/scylla-manager-agent
  agentVersion: 2.5.2
  datacenter:
    name: us-east-1
    racks:
    - agentResources:
        requests:
          cpu: 50m
          memory: 10M
      members: 3
      name: us-east-1a
      resources:
        limits:
          cpu: 1
          memory: 4Gi
        requests:
          cpu: 1
          memory: 4Gi
      scyllaAgentConfig: scylla-agent-config
      scyllaConfig: scylla-config
      storage:
        capacity: 10Gi
  repository: scylladb/scylla
  version: 4.5.1

if so the pod is failing with the next error message:

I1230 14:38:26.581342       1 operator/sidecar.go:158] sidecar version "v1.6.0-7-gac9d88f"
I1230 14:38:26.581437       1 flag/flags.go:59] FLAG: --burst="5"
I1230 14:38:26.581445       1 flag/flags.go:59] FLAG: --cpu-count="1"
I1230 14:38:26.581448       1 flag/flags.go:59] FLAG: --help="false"
I1230 14:38:26.581452       1 flag/flags.go:59] FLAG: --kubeconfig=""
I1230 14:38:26.581456       1 flag/flags.go:59] FLAG: --loglevel="2"
I1230 14:38:26.581461       1 flag/flags.go:59] FLAG: --namespace="scylla"
I1230 14:38:26.581464       1 flag/flags.go:59] FLAG: --qps="2"
I1230 14:38:26.581469       1 flag/flags.go:59] FLAG: --secret-name="scylla-scylla-auth-token"
I1230 14:38:26.581472       1 flag/flags.go:59] FLAG: --service-name="scylla-scylla-us-east-1-us-east-1a-0"
I1230 14:38:26.581475       1 flag/flags.go:59] FLAG: --v="2"
I1230 14:38:26.581847       1 operator/sidecar.go:218] "Waiting for single service informer caches to sync"
I1230 14:38:26.682470       1 operator/sidecar.go:235] "Waiting for Service" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
I1230 14:38:26.686835       1 operator/sidecar.go:269] "Waiting for Pod To have scylla ContainerID set" Pod="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:26.691850       1 cache/reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)
E1230 14:38:28.203022       1 cache/reflector.go:138] k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.Pod: unknown (get pods)
I1230 14:38:28.203142       1 operator/sidecar.go:323] "Waiting for NodeConfig's data ConfigMap " Selector="scylla-operator.scylladb.com/config-map-type=NodeConfigData,scylla-operator.scylladb.com/owner-uid=5488e48b-c678-4766-ad3b-37e2126c22a2"
I1230 14:38:28.208418       1 operator/sidecar.go:385] "Starting scylla"
I1230 14:38:28.208433       1 config/config.go:64] Setting up scylla.yaml
I1230 14:38:28.208578       1 config/config.go:96] "no scylla.yaml config map available"
I1230 14:38:28.211683       1 config/config.go:68] Setting up cassandra-rackdc.properties
I1230 14:38:28.211727       1 config/config.go:157] "unable to read properties" file="/mnt/scylla-config/cassandra-rackdc.properties"
I1230 14:38:28.211845       1 config/config.go:72] Setting up entrypoint script
I1230 14:38:28.227197       1 config/config.go:253] "Scylla version detected" version={version:{Major:4 Minor:5 Patch:1 Pre:[] Build:[]} unknown:false}
I1230 14:38:28.227270       1 config/config.go:282] "Scylla entrypoint" Command="/docker-entrypoint.py --developer-mode=0 --overprovisioned=1 --smp=1 --prometheus-address=0.0.0.0 --listen-address=0.0.0.0 --broadcast-address=10.245.89.175 --broadcast-rpc-address=10.245.89.175 --seeds=10.245.89.175"
I1230 14:38:28.227340       1 cache/shared_informer.go:240] Waiting for caches to sync for Prober
I1230 14:38:28.227358       1 cache/shared_informer.go:247] Caches are synced for Prober 
I1230 14:38:28.227367       1 operator/sidecar.go:414] "Starting Prober server"
I1230 14:38:28.227599       1 sidecar/controller.go:170] "Starting controller" Controller="SidecarController"
I1230 14:38:28.227611       1 cache/shared_informer.go:240] Waiting for caches to sync for SidecarController
I1230 14:38:28.227619       1 cache/shared_informer.go:247] Caches are synced for SidecarController 
running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
ERROR:root:Filesystem at /var/lib/scylla/data has only 9910345728 bytes available; that is less than the recommended 10 GB. Please free up space and run scylla_io_setup again.

failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 27, in <module>
    setup.io()
  File "/scyllasetup.py", line 67, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 37, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.
E1230 14:38:31.835289       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:41.835672       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:38:51.836392       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:01.835292       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:11.835123       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:21.835211       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:31.835776       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:41.834980       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:39:51.835903       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:01.835599       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:11.834676       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:21.835544       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:31.834940       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:41.835909       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:40:51.835099       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:01.835945       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:11.835740       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:21.836038       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:31.835636       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"
E1230 14:41:41.834754       1 sidecar/probes.go:172] "healthz probe: can't connect to JMX" err="dial tcp 10.244.2.190:10001: connect: connection refused" Service="scylla/scylla-scylla-us-east-1-us-east-1a-0"

I think we need to make the defaults more reasonable and fix default capacity at least to 15GiB: https://github.com/scylladb/scylla-operator/blob/6e9424fa2c4206c1e3e6fd74b9398e5a36d91f26/helm/scylla/values.yaml#L58

gecube avatar Dec 30 '21 14:12 gecube

yeah, I guess there is some filesystem overhead, and we should raise the default

tnozicka avatar Dec 30 '21 16:12 tnozicka

The issue is still very much live

Anik-saha avatar Aug 14 '22 20:08 Anik-saha

The issue is still very much live

violinorg avatar Dec 26 '22 20:12 violinorg

The issue is still very much live

The patch was not merged yet, but you may be able to provide feedback - does it solve the issue for you?

mykaul avatar Dec 27 '22 07:12 mykaul

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

/remove-lifecycle stale

gecube avatar Jun 24 '24 06:06 gecube