openwhisk-package-kafka icon indicating copy to clipboard operation
openwhisk-package-kafka copied to clipboard

requeue messages to kafka if subject is throttled

Open rabbah opened this issue 7 years ago • 10 comments

the provider today will retry post requests when the subject is throttled. this is limited however (6 retries, and on the order of a few minutes), after which the message is dropped if it's not posted successfully.

the retry period is too short. more over it is generally better to requeue the messages rather than dropping them entirely. there already is no ordering guarantee so it requeue is no worse in that regard. it would also naturally extend the retry to longer periods.

rabbah avatar Dec 15 '17 18:12 rabbah

@rabbah Could you elaborate on what you mean by requeue the message? It sounds like you are suggesting that we produce the message back to kafka, but this could have all kinds of unintended consequences - keeping in mind the kafka instance being consumed is owned/operated by the trigger owner, not by the feed provider.

If instead you mean to requeue firing the trigger until a later time, I'm not sure how this is better than simply extending the retry period. In this sense, I assume you mean that the provider should attempt to consume the next message in the queue and try to fire a trigger for that message instead, eventually coming back to the original "requeued" message at a later time. However, if the trigger URL is not responding correctly, I don't think that moving on to the next message in the topic will have any better results, as the problem won't be the content of the trigger payload. Even if the problem were the payload content, coming back to fire that message again still wouldn't work as the content would not have changed.

Of course, there is always the possibility that I completely misunderstood your suggestion, and you intend something different from the above two options :smile:

jberstler avatar Dec 19 '17 14:12 jberstler

The current implementation doesn’t persist a message that’s undergoing a retry. If we extend this to 24hrs say the message has to be persisted otherwise the provider risks losing it.

Since kafka addresses the persistence one can imagine enqueuing on a dedicated topic which is drained first before filling from the normal topic.

rabbah avatar Dec 19 '17 14:12 rabbah

I prefer to invest engineering resources into proper queuing into the next phase of the message/event pipeline, meaning that the trigger fire should post message to the OpenWhisk kafka bus and enhanced the controller to pickup events for this “triggers” topic, the controller will pick up work at its own pace and the events/messages will be persistent in the OpenWhisk kafka instance.

Today there is no kafka instance for triggers provider, all instances are user providers and we don’t have ability to write to their instance and should not. Customer/user will not accept into the cost and overhead of an extra topic for queuing message they already produce but OpenWhisk has problems handling hem in a timing manner.

csantanapr avatar Dec 19 '17 15:12 csantanapr

Won't we end up re-engineering this provider? In other words, we should funnel all events through the kafka provider and have a single place for dealing with backoff/retry policies, etc. If you want to then call this kafka provider an intrinsic part of the core or a separate micro service, we can debate separately.

rabbah avatar Dec 19 '17 15:12 rabbah

I still don’t get it what dedicated “topic” you speak about? This provider doesn’t have a kafka instance, are you referring a single dedicated “triggers” topic on the main kafka instance of OpenWhisk? Not a customer provided kafka instance right?

csantanapr avatar Dec 19 '17 15:12 csantanapr

Wait a second? Isn’t the provider can already use the incoming topic as persistence? Meaning if the retry doesn’t work in a timing manner we can choose not to drop the messages and simply do not commit the offset, and the messages will seat there in the topic, the provider can then retry to consume in a more later time let’s say 5 minutes and try again. This is taking into account that the kafka instance is configured with a time to leave for messages long enough to make it thru in a few rounds of retry.

csantanapr avatar Dec 19 '17 15:12 csantanapr

What @csantanapr said is correct. There wouldn't really be a need to requeue records. That data would be persisted for enough time to handle any throttling. Furthermore, a consumer is reading from exactly one topic and firing exactly one trigger. So, if that trigger encounters any 429 responses and we requeue, then the next trigger fire for the next batch of messages will likely also return a 429. We would not see any difference between a requeue and doing a no-op and polling once again. i.e. not committing offsets

abaruni avatar Jan 31 '18 03:01 abaruni

I think we should adjust the number of retries we're attempting, but maybe our strategy as well. So, some questions:

How many retries?

How long, in total, should we keep retrying for? i.e. is there a reasonable number that might overcome the limits in place, in most cases?

Is the current back off strategy appropriate? We use a simple exponential back off. Would increasing the number of retries help? Should we move from a simple back off to some other interval? i.e. wait 15 sec, then 30, 60, 120 etc...

Should we add a jitter to randomize the backoff?

abaruni avatar Jan 31 '18 03:01 abaruni

@csantanapr @rabbah @jberstler any thoughts?

abaruni avatar Jan 31 '18 03:01 abaruni

Like @csantanapr, I find it a little hard to believe we should expect customers to set up a separate topic for us, as well as allow us write access to it, in order to handle the case where OpenWhisk cannot (or will not) handle the rate at which they are producing messages. This is not to mention asking them to configure their triggers to make use of this new topic, which adds even more configuration burden and potential confusion when creating the trigger.

Like @abaruni, I think the matter could be much more simply handled by allowing a longer retry period - possibly configurable by the trigger creator. Because the message contents are kept in memory, this retry period could, at least in theory, extend even beyond the customer's own Kafka retention policy - so long as the provider instance didn't lose the contents of its memory (say, through a reboot). I'm not sure I see a benefit to adjusting the exponential nature of the retry behavior. Once you start waiting > 60 seconds between retries, I don't think the exact value of how long you wait is really all that significant.

I also think it is reasonable that if the customer wants to ensure that no messages are lost due to throttling, that they increase their cluster's message retention period in addition to configuring the new retry timeout parameter on their trigger.

jberstler avatar Feb 02 '18 14:02 jberstler