keda
keda copied to clipboard
Possible connections leak in kafka scaler
Report
There appears to be a connections leak in the kafka scaler.
Scenario: We had keda working fine with a kafka scaler. The MSK cluster ran out of disk space which caused it to be unaccessible to keda. We resolved the disk space issue by scaling the broker storage. After this keda appears to have returned to normal, however one of the MSK brokers was reporting an unusually high amount of connections and requests from the keda metrics apiserver (thousands of requests and connections in a 5m period). upon restart of the apiserver pod, the connection and request count goes back to normal (<30 conn/requests in a 5m period to the brokers)
We have two different keda deployments pointing to this MSK cluster (watching different consumer groups). Both behaved the same way. This is a problem we've been trying to root cause for 2wks now and only today were able to track down the source of the connections to keda. I have cloudwatch graph of the past 2wks where we tried rebooting brokers, etc. but nothing seemed to eliminate the tcp connections until today when we rebooted the two keda metrics apiserver pods.
I can try and obtain the logs from these pods if it may be useful. But they may be extremely verbose and large (we have trace logging enabled still as we only recently started using keda and were having issues properly configuring scaled objects for kafka). Please lmk what may be useful for troubleshooting this further if this has not been enough data.
Expected Behavior
Interruptions in kafka server availability do not inadvertantly degrade the clusters performance upon renewal of service
Actual Behavior
Once the cluster was restored, the increase in connections from the scaler caused the MSK cluster to be in an extremely poor state. These specific brokers(instance types) are not designed to handle the thousands of connections the apiserver was throwing at it.
Steps to Reproduce the Problem
- See description
Logs from KEDA operator
See description
KEDA Version
2.7.1
Kubernetes Version
1.21
Platform
Amazon Web Services
Scaler Details
Apache Kafka
Anything else?
No response
Thanks for reporting this!
To clarify, you can see those connections only in KEDA Metrics Server not in the Operator, right? We need to track down, whether this is bad behavior in KEDA or whether the problem is in the Sarama client itself 🤔
Could you pleaes post here ScaledObject definition? Do you use fallback
? It might help tracking the issue down. Also, what is the ~ number of SOs that you have? Thanks!
To clarify, you can see those connections only in KEDA Metrics Server not in the Operator, right? We need to track down, whether this is bad behavior in KEDA or whether the problem is in the Sarama client itself 🤔
Correct. We only saw the connections coming from the apiserver, not the operator.
Could you pleaes post here ScaledObject definition? Do you use
fallback
? It might help tracking the issue down. Also, what is the ~ number of SOs that you have? Thanks!
We only have a single scaledobject. I'm not familiar with 'fallback'.
spec:
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
policies:
- periodSeconds: 450
type: Percent
value: 20
- periodSeconds: 450
type: Pods
value: 1
stabilizationWindowSeconds: 300
scaleUp:
policies:
- periodSeconds: 450
type: Percent
value: 100
- periodSeconds: 450
type: Pods
value: 2
stabilizationWindowSeconds: 300
maxReplicaCount: 40
minReplicaCount: 1
scaleTargetRef:
kind: StatefulSet
name: logstash
triggers:
- authenticationRef:
name: kafka-tls
metadata:
allowIdleConsumers: "false"
bootstrapServers: <list of kafka brokers; ie b-1.aws:9094,b-2.aws:9094,b-3.aws:9094>
consumerGroup: logstash-group
lagThreshold: "180"
offsetResetPolicy: latest
topic: filebeat-logging
type: kafka
1 SO and thousands of connections? 🤔 That's not good. Would be great if we can somehow isolate the problem and have reproducer that doesn't depend on your setup, Strimzi or something like that.
1 SO and thousands of connections? 🤔 That's not good. Would be great if we can somehow isolate the problem and have reproducer that doesn't depend on your setup, Strimzi or something like that.
Technically for the kafka cluster, there were two SOs pointing at it. Two different k8s clusters with the SO above running with only the topic and consumer group being different.
The dip is when we rebooted each of the api server pods on the two different clusters. The value/scale is # connections per minute being made to each broker
Yeah. And until there aren't any problems with the broker you don't see any rise in the number of connections, right?
Correct. Until we ran out of storage space on the brokers, the connection count from the apiserver was minimal.
Are you able to somehow reproduce the problem? It would be awesome if you can do so. There's one part of code which could be a potential cuprit, though I'd need to have a way how to reproduce the problem to clarify that I am right and that the proposed fix works as expected.
Sorry, no. We do not have a test environment where I can reproduce this issue. This happened on a semi-important system and we cant have another outage like this.
Just for a reference adding the link to the potential problem: https://github.com/kedacore/keda/blob/af80d84b4e95536bde4d3b4ee51785214d01dac9/pkg/scaling/cache/scalers_cache.go#L199-L209
sb := c.Scalers[id]
ns, err := sb.Factory()
if err != nil {
return nil, err
}
c.Scalers[id] = ScalerBuilder{
Scaler: ns,
Factory: sb.Factory,
}
sb.Scaler.Close(ctx)
return ns, nil
->
defer sb.Scaler.Close(ctx)
sb := c.Scalers[id]
ns, err := sb.Factory()
if err != nil {
return nil, err
}
c.Scalers[id] = ScalerBuilder{
Scaler: ns,
Factory: sb.Factory,
}
return ns, nil
I'm trying to replicate this issue but I cannot do it. Could you give me any tip to reproduce this? I'm trying dropping the cluster connection but I can't see any connection leak. I'm deploying a MSK and I'll try to replicate disk issues, but maybe you can give me some tips
I have tried the same, locking the disks by filling them (with KEDA connected all the time) and I haven't been able to reproduce your issue, once I have solved the issues with disks increasing the disk size, KEDA has started again without any huge increase of connections. I'd need more info about how to reproduce it to continue.
I have tried with a kafka cluster deployed using strimzi
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.