kubeless icon indicating copy to clipboard operation
kubeless copied to clipboard

Kubeless kafka-trigger-controller performance

Open dap332 opened this issue 6 years ago • 13 comments

What happened: The kafka-trigger-controller seems to fire kubeless functions at up to 5 qps. I have made some simple functions that do not process data or even write out to console. Simply return event.data; Still, the function call rate does not ever exceed 5 qps. I am using Prometheus and Grafana for monitoring. What you expected to happen: How to reproduce it (as minimally and precisely as possible): Use the out of the box kafka installation as shown in the Kubeless docs. Create a simple function that simply logs or returns the data being passed in the function. I have not been able to see it exceed 5 qps.

Here is an example below:

module.exports = {
        consume: (event, context) => {
//              console.log(event.data);
                return event.data;
        }

}

Anything else we need to know?:

Environment:

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Kubeless version:
Kubeless version: v0.6.0

dap332 avatar Jun 20 '18 11:06 dap332

Observe identical behavior, 1668 function invocations in 333 seconds is exactly 5 qps I have 3 different functions, invoked at different rates, the combined rate is throttled at 5 qps I am using kafka-trigger-controller logs and Kafka console consumer to monitor

Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-18T23:58:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Kubeless version: v1.0.0-alpha.7

gheist avatar Aug 18 '18 04:08 gheist

There could two possible causes for this issue:

  • Something wrong in the Kafka configuration.
  • Lack of parallelization when sending requests to the final function from the Kafka consumer (code here: https://github.com/kubeless/kafka-trigger/blob/master/pkg/event-consumers/kafka/kafka-consumer.go#L91).

I would say that's probably the second option. I have created a branch to check that. Could you test it? For doing so edit your current deployment (kubectl edit deployment -n kubeless kafka-trigger-controller) and change the image to bitnami/kafka-trigger-controller:build-3d1b2c3a-7ca8-4fd7-a6f6-816e54737852? (this is the CI job that generated that image: https://circleci.com/gh/kubeless/kafka-trigger/119)

andresmgot avatar Aug 20 '18 08:08 andresmgot

Hi,

So I did test out the new branch that you linked. And at first it did not change anything (still getting 5 qps). But once I scaled the kafka-trigger-controller it would scale with 5 qps per each kafka-trigger pod. So 5 qps by default, 10 qps with 2 kakfa-trigger-controller pods, and 15 qps with 3. I have not tested this with the regular non-parallelized build.

dap332 avatar Sep 04 '18 13:09 dap332

That was not what I was expecting, with the modified version a single trigger-controller should be able to send more queries per second. I will try to reproduce the environment and investigate what can be the bottleneck.

andresmgot avatar Sep 05 '18 08:09 andresmgot

I have been checking this issue. I have modified the image to mark the offset of the message just after receiving a message. That means that the consumer will be ready to receive the next message without waiting for it to be delivered to the function. The new image ID is: bitnami/kafka-trigger-controller:build-1631f2e8-7486-4c96-a849-21c75dd145a6

Apart from that, if you want to be able to receive more qps you need to:

  • Increase the number of partitions in a topic. If you use the command kubeless topic create to create a topic this sets the number of partitions to 1. If you want to receive messages in parallel you need to use a bigger number. This can be done directly executing this command in the Kafka container: /opt/bitnami/kafka/bin/kafka-topics.sh --zookeeper zookeeper.kubeless:2181 --replication-factor 1 --partitions 5 --create --topic <your_topic_id>
  • Increase the number of consumers. As you mention, you can increase the number of consumers scaling the kafka-trigger-controller deployment.

This post can help to clarify the relation between the topics, messages, partitions and consumers: https://stackoverflow.com/questions/25602359/kafka-how-to-consume-one-topic-parallel

andresmgot avatar Sep 06 '18 14:09 andresmgot

So I tested out this new image ID, and I also increases the partitions to 5.

What I found is that I still get 5 qps on 1 pod of kafka-trigger-controller. But scaling it up does not increase qps like before. The logs from the newer pods that were deployed from scaling show it is not receiving and pushing messages. I am not sure, but this may be due to the fact that the orignal kafka-trigger-controller pod was manually set to a partiion of 5, while the newly spawned ones are set to the default value of 1.

I hope this helps.

dap332 avatar Sep 17 '18 14:09 dap332

@dap332 I haven't changed how the consumers are initialized so the partitions should be the same. The only thing I changed is that the messages are now handled in a goroutine that is run in parallel. Note that partitions are set when creating the topic, can you verify if the topic you are using have partitions set to 5?

Also we have released a version with the patch: can you try comparing the versions v1.0.0-alpha.9 and v1.0.0-beta.0? Can you check that if upgrading the version and using the same topic the kafka-trigger-controller behaves differently?

Thanks in advance for helping debugging this issue!

andresmgot avatar Sep 18 '18 10:09 andresmgot

bitnami/kafka-trigger-controller:v1.0.0-alpha.8, partitions = 5

image

  1. Pod Scalings Kafka-trigger-controller: 1 Predict (function) : 3 Kafka – 1 Zoo – 1 QPS: 5

  2. Pod Scalings Kafka-trigger-controller: 3 Predict (function) : 3 Kafka – 1 Zoo – 1 QPS: 5

Kafka-trigger-controller logs from an inactive controller:

$ kubectl -n kubeless logs -f --tail=25 kafka-trigger-controller-6959dcd4cd-w5jtq 
time="2018-09-20T15:22:36Z" level=info msg="Running Kafka controller version: "
time="2018-09-20T15:22:36Z" level=info msg="Starting Kafka Trigger controller" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Processing update to function object predict Namespace: sentiment" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="We got a Kafka trigger  predict-trigger that function predict need to be associated so create Kafka consumer" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Creating Kafka consumer for the function predict associated with for trigger predict-trigger"
time="2018-09-20T15:22:36Z" level=info msg="Created Kafka consumer for the function predict associated with for trigger predict-trigger"
time="2018-09-20T15:22:36Z" level=info msg="Successfully created Kafka consumer for Function: predict" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Successfully processed update to function object predict Namespace: sentiment" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Started Kakfa consumer Broker: kafka.kubeless:9092, Topic: test-topic, Function: predict, consumerID: sentiment_predict-trigger_predict_test-topic"
time="2018-09-20T15:22:36Z" level=info msg="Kafka Trigger controller caches are synced and ready" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Processing update to Kafka Trigger: sentiment/predict-trigger" controller=kafka-trigger-controller
time="2018-09-20T15:22:36Z" level=info msg="Consumer for function predict associated with trigger predict-trigger already exists, so just returning"
time="2018-09-20T15:22:36Z" level=info msg="Processed change to Kafka trigger: predict-trigger Namespace: sentiment" controller=kafka-trigger-controller
time="2018-09-20T15:22:37Z" level=info msg="Rebalanced: &{Type:rebalance start Claimed:map[] Released:map[] Current:map[]}\n"
time="2018-09-20T15:22:37Z" level=info msg="Rebalanced: &{Type:rebalance OK Claimed:map[test-topic:[3 4]] Released:map[] Current:map[test-topic:[3 4]]}\n"
time="2018-09-20T15:22:44Z" level=info msg="Rebalanced: &{Type:rebalance start Claimed:map[] Released:map[] Current:map[test-topic:[3 4]]}\n"
time="2018-09-20T15:22:44Z" level=info msg="Rebalanced: &{Type:rebalance OK Claimed:map[test-topic:[]] Released:map[test-topic:[]] Current:map[test-topic:[3 4]]}\n"

dap332 avatar Sep 20 '18 17:09 dap332

I tested it with the alpha version and I still see it maxing out to 5 qps with partition set to 5. And what is even more interesting is, that when scaling the kafka-trigger-controller to 3, only one is actively processing messages while the others rebalance upon deployment and wait.

I have no name!@kafka-0:/$ /opt/bitnami/kafka/bin/kafka-topics.sh --zookeeper zookeeper.kubeless:2181 --describe --topic test-topic
Topic:test-topic	PartitionCount:5	ReplicationFactor:1	Configs:
	Topic: test-topic	Partition: 0	Leader: 1001	Replicas: 1001	Isr: 1001
	Topic: test-topic	Partition: 1	Leader: 1001	Replicas: 1001	Isr: 1001
	Topic: test-topic	Partition: 2	Leader: 1001	Replicas: 1001	Isr: 1001
	Topic: test-topic	Partition: 3	Leader: 1001	Replicas: 1001	Isr: 1001
	Topic: test-topic	Partition: 4	Leader: 1001	Replicas: 1001	Isr: 1001

Will test with the beta version soon

dap332 avatar Sep 20 '18 17:09 dap332

bitnami/kafka-trigger-controller:v1.0.0-beta.0, partition = 5

image

  1. Pod Scalings Kafka-trigger-controller: 1 Predict (function) : 3 Kafka – 1 Zoo – 1 QPS: 5

  2. Pod Scalings Kafka-trigger-controller: 3 Predict (function) : 3 Kafka – 1 Zoo – 1 QPS: 5

Similar logs from an inactive kafka-trigger-controller. When scaled on one pod is active, the rest sit idle it seems.

dap332 avatar Sep 21 '18 15:09 dap332

Fwiw, the kafka-trigger-controller seems to lookup the function port through the K8S API for each processed message. As a workaround, I tried just hard-coding the port instead of using lookups, and performance improved substantially -- throughput now seems latency-bound, in my tests.

wdamron avatar Oct 25 '18 23:10 wdamron

I'm observing similar behavior with the NATS trigger, and it has the same non-parallel critical section inside the trigger controller. Really each processed message should kick off a goroutine and immediately ACK back to the stream, since there's no failure handling right now anyway.

lharding avatar Feb 14 '19 22:02 lharding

Any progress on this ?

yvz5 avatar Aug 05 '19 11:08 yvz5