keda icon indicating copy to clipboard operation
keda copied to clipboard

Possible connections leak in kafka scaler

Open ryan-dyer-sp opened this issue 2 years ago • 12 comments

Report

There appears to be a connections leak in the kafka scaler.

Scenario: We had keda working fine with a kafka scaler. The MSK cluster ran out of disk space which caused it to be unaccessible to keda. We resolved the disk space issue by scaling the broker storage. After this keda appears to have returned to normal, however one of the MSK brokers was reporting an unusually high amount of connections and requests from the keda metrics apiserver (thousands of requests and connections in a 5m period). upon restart of the apiserver pod, the connection and request count goes back to normal (<30 conn/requests in a 5m period to the brokers)

We have two different keda deployments pointing to this MSK cluster (watching different consumer groups). Both behaved the same way. This is a problem we've been trying to root cause for 2wks now and only today were able to track down the source of the connections to keda. I have cloudwatch graph of the past 2wks where we tried rebooting brokers, etc. but nothing seemed to eliminate the tcp connections until today when we rebooted the two keda metrics apiserver pods.

I can try and obtain the logs from these pods if it may be useful. But they may be extremely verbose and large (we have trace logging enabled still as we only recently started using keda and were having issues properly configuring scaled objects for kafka). Please lmk what may be useful for troubleshooting this further if this has not been enough data.

Expected Behavior

Interruptions in kafka server availability do not inadvertantly degrade the clusters performance upon renewal of service

Actual Behavior

Once the cluster was restored, the increase in connections from the scaler caused the MSK cluster to be in an extremely poor state. These specific brokers(instance types) are not designed to handle the thousands of connections the apiserver was throwing at it.

Steps to Reproduce the Problem

  1. See description

Logs from KEDA operator

See description

KEDA Version

2.7.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

Apache Kafka

Anything else?

No response

ryan-dyer-sp avatar Jun 09 '22 16:06 ryan-dyer-sp

Thanks for reporting this!

To clarify, you can see those connections only in KEDA Metrics Server not in the Operator, right? We need to track down, whether this is bad behavior in KEDA or whether the problem is in the Sarama client itself 🤔

Could you pleaes post here ScaledObject definition? Do you use fallback? It might help tracking the issue down. Also, what is the ~ number of SOs that you have? Thanks!

zroubalik avatar Jun 14 '22 09:06 zroubalik

To clarify, you can see those connections only in KEDA Metrics Server not in the Operator, right? We need to track down, whether this is bad behavior in KEDA or whether the problem is in the Sarama client itself 🤔

Correct. We only saw the connections coming from the apiserver, not the operator.

Could you pleaes post here ScaledObject definition? Do you use fallback? It might help tracking the issue down. Also, what is the ~ number of SOs that you have? Thanks!

We only have a single scaledobject. I'm not familiar with 'fallback'.

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 450
            type: Percent
            value: 20
          - periodSeconds: 450
            type: Pods
            value: 1
          stabilizationWindowSeconds: 300
        scaleUp:
          policies:
          - periodSeconds: 450
            type: Percent
            value: 100
          - periodSeconds: 450
            type: Pods
            value: 2
          stabilizationWindowSeconds: 300
  maxReplicaCount: 40
  minReplicaCount: 1
  scaleTargetRef:
    kind: StatefulSet
    name: logstash
  triggers:
  - authenticationRef:
      name: kafka-tls
    metadata:
      allowIdleConsumers: "false"
      bootstrapServers: <list of kafka brokers; ie b-1.aws:9094,b-2.aws:9094,b-3.aws:9094>
      consumerGroup: logstash-group
      lagThreshold: "180"
      offsetResetPolicy: latest
      topic: filebeat-logging
    type: kafka

ryan-dyer-sp avatar Jun 14 '22 15:06 ryan-dyer-sp

1 SO and thousands of connections? 🤔 That's not good. Would be great if we can somehow isolate the problem and have reproducer that doesn't depend on your setup, Strimzi or something like that.

zroubalik avatar Jun 15 '22 11:06 zroubalik

1 SO and thousands of connections? 🤔 That's not good. Would be great if we can somehow isolate the problem and have reproducer that doesn't depend on your setup, Strimzi or something like that.

Technically for the kafka cluster, there were two SOs pointing at it. Two different k8s clusters with the SO above running with only the topic and consumer group being different.

ryan-dyer-sp avatar Jun 15 '22 12:06 ryan-dyer-sp

image The dip is when we rebooted each of the api server pods on the two different clusters. The value/scale is # connections per minute being made to each broker

ryan-dyer-sp avatar Jun 15 '22 12:06 ryan-dyer-sp

Yeah. And until there aren't any problems with the broker you don't see any rise in the number of connections, right?

zroubalik avatar Jun 15 '22 13:06 zroubalik

Correct. Until we ran out of storage space on the brokers, the connection count from the apiserver was minimal. image

ryan-dyer-sp avatar Jun 15 '22 13:06 ryan-dyer-sp

Are you able to somehow reproduce the problem? It would be awesome if you can do so. There's one part of code which could be a potential cuprit, though I'd need to have a way how to reproduce the problem to clarify that I am right and that the proposed fix works as expected.

zroubalik avatar Jun 15 '22 19:06 zroubalik

Sorry, no. We do not have a test environment where I can reproduce this issue. This happened on a semi-important system and we cant have another outage like this.

ryan-dyer-sp avatar Jun 15 '22 19:06 ryan-dyer-sp

Just for a reference adding the link to the potential problem: https://github.com/kedacore/keda/blob/af80d84b4e95536bde4d3b4ee51785214d01dac9/pkg/scaling/cache/scalers_cache.go#L199-L209

        sb := c.Scalers[id]
	ns, err := sb.Factory()
	if err != nil {
		return nil, err
	}

	c.Scalers[id] = ScalerBuilder{
		Scaler:  ns,
		Factory: sb.Factory,
	}
	sb.Scaler.Close(ctx)

	return ns, nil

->

        defer sb.Scaler.Close(ctx)
        sb := c.Scalers[id]
	ns, err := sb.Factory()
	if err != nil {
		return nil, err
	}

	c.Scalers[id] = ScalerBuilder{
		Scaler:  ns,
		Factory: sb.Factory,
	}

	return ns, nil

zroubalik avatar Jul 14 '22 13:07 zroubalik

I'm trying to replicate this issue but I cannot do it. Could you give me any tip to reproduce this? I'm trying dropping the cluster connection but I can't see any connection leak. I'm deploying a MSK and I'll try to replicate disk issues, but maybe you can give me some tips

JorTurFer avatar Jul 28 '22 18:07 JorTurFer

I have tried the same, locking the disks by filling them (with KEDA connected all the time) and I haven't been able to reproduce your issue, once I have solved the issues with disks increasing the disk size, KEDA has started again without any huge increase of connections. I'd need more info about how to reproduce it to continue. I have tried with a kafka cluster deployed using strimzi

JorTurFer avatar Jul 28 '22 23:07 JorTurFer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 27 '22 03:09 stale[bot]

This issue has been automatically closed due to inactivity.

stale[bot] avatar Oct 04 '22 04:10 stale[bot]