[Bug]: Unstable Jaeger Deployment with Cassandra ; Cassandra STS is failing
What happened?
Cassandra stateful set is not stable and keeps crashing.
Steps to reproduce
- Install OTEL SDK on some app.
- Install Jaeger latest helm chart 1.0.0.
Expected behavior
Jaeger available with alll pods running stable.
Relevant log output
│ INFO [main] 2024-02-28 09:58:00,582 QueryProcessor.java:163 - Preloaded 0 prepared statements │
│ INFO [main] 2024-02-28 09:58:00,582 StorageService.java:657 - Cassandra version: 3.11.6 │
│ INFO [main] 2024-02-28 09:58:00,582 StorageService.java:658 - Thrift API version: 20.1.0 │
│ INFO [main] 2024-02-28 09:58:00,582 StorageService.java:659 - CQL supported versions: 3.4.4 (default: 3.4.4) │
│ INFO [main] 2024-02-28 09:58:00,582 StorageService.java:661 - Native protocol supported versions: 3/v3, 4/v4, 5/v5-beta (default: 4/v4) │
│ INFO [main] 2024-02-28 09:58:00,599 IndexSummaryManager.java:87 - Initializing index summary manager with a memory pool size of 99 MB and a resize interval of 60 minutes │
│ INFO [main] 2024-02-28 09:58:00,604 MessagingService.java:750 - Starting Messaging Service on /10.50.26.33:7000 (eth0) │
│ INFO [main] 2024-02-28 09:58:00,619 OutboundTcpConnection.java:108 - OutboundTcpConnection using coalescing strategy DISABLED │
│ INFO [HANDSHAKE-jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49] 2024-02-28 09:58:00,628 OutboundTcpConnection.java:561 - Handshaking version with jaeger-solutions-cassandra-0.jaeger-solutions-cassandra.jaeger-solutions.svc.cluster.local/10.50.30.49 │
│ INFO [ScheduledTasks:1] 2024-02-28 09:58:03,885 TokenMetadata.java:517 - Updating topology for all endpoints that have changed │
│ Exception (java.lang.UnsupportedOperationException) encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true │
│ at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613) │
│ at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844) │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703) │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) │
│ at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) │
│ at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) │
│ at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) │
│ ERROR [main] 2024-02-28 09:58:06,635 CassandraDaemon.java:774 - Exception encountered during startup │
│ java.lang.UnsupportedOperationException: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap while cassandra.consistent.rangemovement is true │
│ at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:613) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:844) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:703) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6] │
│ INFO [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 HintsService.java:209 - Paused hints dispatch │
│ WARN [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown │
│ INFO [StorageServiceShutdownHook] 2024-02-28 09:58:06,637 MessagingService.java:985 - Waiting for messaging service to quiesce │
│ INFO [ACCEPT-/10.50.26.33] 2024-02-28 09:58:06,638 MessagingService.java:1346 - MessagingService has terminated the accept() thread │
│ INFO [StorageServiceShutdownHook] 2024-02-28 09:58:06,759 HintsService.java:209 - Paused hints dispatch
Screenshot
Additional context
Running Jaeger on a dedicated namespace on EKS.
Jaeger backend version
1.53.0
SDK
OpenTelemetry SDK.
Pipeline
No response
Stogage backend
Cassandra
Operating system
Linux
Deployment model
Kubernetes
Deployment configs
provisionDataStore:
cassandra: true
elasticsearch: false
kafka: false
agent:
enabled: false
query:
ingress:
enabled: true
ingressClassName: nginx
hosts:
- jaeger-ui-solutions.internal.lightrun.com
config: |-
{
"dependencies": {
"dagMaxNumServices": 200,
"menuEnabled": true
},
"archiveEnabled": true,
"tracking": {
"gaID": "UA-000000-2",
"trackErrors": true
}
}
cassandra:
resources:
requests:
memory: 10Gi
cpu: 6
limits:
memory: 16Gi
cpu: 10
collector:
service:
otlp:
grpc:
name: otlp-grpc
port: 4317
http:
name: otlp-http
port: 4318
Try the latest version 1.0.2
I upgraded to 1.0.2 and used node selector for more stable nodes (not spot instances). It works now, see if it'll be stable, I'll update
close the issue if it sorted
I still can't seem to make Jaeger stable, I got this errors:
ERROR [main] 2024-04-11 08:29:47,486 CassandraDaemon.java:774 - Exception encountered during startup │
│ java.lang.RuntimeException: A node required to move the data consistently is down (/10.50.13.161). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false │
│ at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithStrictSourcesFor(RangeStreamer.java:294) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:177) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1530) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1024) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:718) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.StorageService.initServer(StorageService.java:652) ~[apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:397) [apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:630) [apache-cassandra-3.11.6.jar:3.11.6] │
│ at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:757) [apache-cassandra-3.11.6.jar:3.11.6] │
│ INFO [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 HintsService.java:209 - Paused hints dispatch │
│ WARN [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 Gossiper.java:1655 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown │
│ INFO [StorageServiceShutdownHook] 2024-04-11 08:29:47,488 MessagingService.java:985 - Waiting for messaging service to quiesce │
│ INFO [ACCEPT-/10.50.10.10] 2024-04-11 08:29:47,489 MessagingService.java:1346 - MessagingService has terminated the accept() thread
looks similar. ran into this with one of the pod keeps crashing
with the 3.0.10 chart
jaeger-cassandra-0 1/1 Running 0 13d 10.0.3.24 c21 <none> <none>
jaeger-cassandra-1 0/1 CrashLoopBackOff 6 (2m7s ago) 12m 10.0.10.216 c34 <none> <none>
jaeger-cassandra-2 1/1 Running 0 46d 10.0.0.47 p11 <none> <none>
Same issue here. K8S cluster on our own machines. 3 nodes, 3 cassandra pods deployed via Jaeger's chart. Nodes were recently upgraded, and they were rebooted. Now jaeger-cassandra-2 pod is in a crashloop, complaining about lack of cassandra pod on the third node - which I assume was rebooted last.
Cassandra logs mention: (...) restart the node with -Dcassandra.consistent.rangemovement=false
I have yet to figure out where exactly to add such a flag. in Jaeger's chart, adding this:
storage:
cassandra:
cmdlineParams:
cassandra.consistent.rangemovement: false
didn't seem to do anything.