scylla-operator icon indicating copy to clipboard operation
scylla-operator copied to clipboard

Errors like `alternator: get node info: no host config available` and `CQL: no host config available` when running `sctools status` after an update

Open gdubicki opened this issue 7 months ago • 12 comments

What happened?

After an update of Scylla from 5.2.9 to 5.4.7, Scylla Operator from 1.9.x to 1.12.2 (latest that supports Scylla 5.2.x and 5.4.x), Scylla Manager from 3.1.x to 3.2.8, we started to observe that sctool status doesn't provide all the node info anymore and returns errors:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla/scylla
Datacenter: XXX
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST     | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.241.130 | -      | -    | -      | -      | -     | 8a24c600-5525-490e-a3cd-314f6062d6a1 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (6ms) | 10.7.241.174 | -      | -    | -      | -      | -     | f14fcd59-8d90-4d8e-af22-ace87ceced22 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.241.175 | -      | -    | -      | -      | -     | 050dcc67-7bb8-4d5d-89b1-5dbe0bcbb8b2 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (5ms) | 10.7.243.109 | -      | -    | -      | -      | -     | 4a3ff045-bba2-4537-a4d7-a213d25ae713 |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.248.124 | -      | -    | -      | -      | -     | 028023f5-9d4e-404c-8537-467ac3d4538c |
| UN | ERROR (0ms) | ERROR (0ms) | UP (1ms) | 10.7.249.238 | -      | -    | -      | -      | -     | b8f68c62-c462-4a30-a505-5ece9ae1ab0b |
| UN | ERROR (0ms) | ERROR (0ms) | UP (0ms) | 10.7.252.229 | -      | -    | -      | -      | -     | 1ff1b8df-7a90-4321-a309-7cd69e20bd70 |
+----+-------------+-------------+----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.241.130 alternator: get node info: no host config available
- 10.7.241.130 CQL: no host config available
- 10.7.241.174 alternator: get node info: no host config available
- 10.7.241.174 CQL: no host config available
- 10.7.241.175 alternator: get node info: no host config available
- 10.7.241.175 CQL: no host config available
- 10.7.243.109 alternator: get node info: no host config available
- 10.7.243.109 CQL: no host config available
- 10.7.248.124 alternator: get node info: no host config available
- 10.7.248.124 CQL: no host config available
- 10.7.249.238 alternator: get node info: no host config available
- 10.7.249.238 CQL: no host config available
- 10.7.252.229 alternator: get node info: no host config available
- 10.7.252.229 CQL: no host config available

Note that our scylla.yaml didn't have any config for TLS up to that point.

This problem has been worked around by setting this:

client_encryption_options:
  optional: true

However, we still have a problem with the Scylla Manager's cluster:

$ kubectl exec -it deployments/scylla-manager -n scylla-manager -- sctool status --cluster scylla-manager/scylla-manager
Datacenter: manager-dc
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
|    | Alternator  | CQL         | REST      | Address      | Uptime | CPUs | Memory | Scylla | Agent | Host ID                              |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
| UN | ERROR (0ms) | ERROR (0ms) | UP (92ms) | 10.7.255.190 | -      | -    | -      | -      | -     | 8ec8a729-8225-4278-a9da-ad0f23f47e01 |
+----+-------------+-------------+-----------+--------------+--------+------+--------+--------+-------+--------------------------------------+
Errors:
- 10.7.255.190 alternator: get node info: no host config available
- 10.7.255.190 CQL: no host config available

...and it seems to only have a generated ConfigMap named scylladb-managed-config:

apiVersion: v1
data:
  scylladb-managed-config.yaml: |
    cluster_name: "scylla"
    rpc_address: "0.0.0.0"
    endpoint_snitch: "GossipingPropertyFileSnitch"
    internode_compression: "all"
    native_transport_port_ssl: 9142
    native_shard_aware_transport_port_ssl: 19142
    client_encryption_options:
      enabled: true
      optional: false
      certificate: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.crt"
      keyfile: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/serving-certs/tls.key"
      require_client_auth: true
      truststore: "/var/run/secrets/scylla-operator.scylladb.com/scylladb/client-ca/tls.crt"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: scylla
    meta.helm.sh/release-namespace: scylla
    scylla-operator.scylladb.com/managed-hash: <redacted>
==
  creationTimestamp: "<redacted>"
  labels:
    app.kubernetes.io/managed-by: Helm
    scylla/cluster: scylla
  name: scylla-managed-config
  namespace: scylla
  ownerReferences:
  - apiVersion: scylla.scylladb.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: ScyllaCluster
    name: scylla
    uid: <redacted>
  resourceVersion: "<redacted>"
  uid: <redacted>

...and I can't find anything about modifying it in the https://operator.docs.scylladb.com/stable/helm.html...

Since then we have updated Scylla to 5.4.9, Operator to 1.13.0, and Manager to 3.3.0 but it did not help.

What did you expect to happen?

sctool status should work without errors for both main cluster as well as Scylla Manager's one after an update.

I shouldn't have to reconfigure TLS as the defaults shown in https://github.com/scylladb/scylladb/blob/scylla-5.4.7/conf/scylla.yaml#L472-L474 say that it should be disabled.

How can we reproduce it (as minimally and precisely as possible)?

  1. Set up versions like mentioned above
  2. Use this scylla.yaml, as we had before:
read_request_timeout_in_ms: 5000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
    
consistent_cluster_management: true
  1. Update to the versions mentioned above
  2. Check sctool status

Scylla Operator version

1.13.0

Kubernetes platform name and version

$ kubectl version
Client Version: v1.29.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5-gke.1192000

Please attach the must-gather archive.

scylla-operator-must-gather-77t6kvnghzss.zip

Anything else we need to know?

The must-gather archive has been anonymized additionally by me manually, see https://github.com/scylladb/scylla-operator/issues/2015.

This problem has originally been reported here https://github.com/scylladb/scylla-manager/issues/3889, but that issue was originally about a (probably?) different problem, so I was suggested to create a new one.

gdubicki avatar Jul 12 '24 15:07 gdubicki