azure-event-hubs-go icon indicating copy to clipboard operation
azure-event-hubs-go copied to clipboard

Handling network partitions

Open ragoragino opened this issue 6 years ago • 3 comments

I wanted to ask about the way one should handle network partitions in the case of EPH. When running EventHub peer through EPH, the Start method calls scheduler's Run method which will run forever, retrying any connections lost.

However, in a production environment, we would love to be able to, if our services happen to become misconfigured (and loose network connectivity), to let them fail as quickly possible. Otherwise, we run the risk of having a misconfigured environment and events accumulating in the EventHub with no reliable receivers and no way of knowing that there is a problem in the cluster.

In other publisher-subscriber systems (e.g. NATS), we have worked with callback handlers in case of disconnect events and such. Do you think that it might be reasonable to implement something like this to the EPH, or do you think that EPH should handle such circumstances differently?

Thanks.

ragoragino avatar Jul 03 '19 08:07 ragoragino

Awesome question!

Let me preface that I don't particularly like EPH, though it does help folks who are looking for a relatively simple solution to balancing work across consumers dynamically.

Callback handlers could be used and is an option. Perhaps, they would be optionally passed into the EPH constructor. Is that how you'd really want to interact with EPH?

Though, do you really want to use EPH at this point? Perhaps, it would be easier to subscribe to your partitions directly and have a subscription not try to recover from an error.

In the later scenario, I'm thinking of having a stateful service in K8s with as many instances as Event Hub partitions. Each partition would store the checkpoint entry in a persistent volume and use the instance number for the stateful service instance as the partition id to listen. That way, you always have 1 instance handling 1 partition. When an instance fails, the listener should just die. You see the logs and the orchestrator cleans up and restarts.

Thoughts?

devigned avatar Jul 08 '19 20:07 devigned

Thanks for the answer!

I consider the possibility to dynamically load-balance events based on the current CPU/memory usage quite useful. And even though, with your example, it is still possible, it is achieved with a level of indirection through a StatefulService (which in the case of 1:1 relationship between partition and instance could also increase our CPU/memory requests if we would want to use them). So, what EPH does, in my opinion, is hides this layer of indirection. And why don't you consider EPH useful?

And what did you mean by "when an instance fails"? We considered more an issue, when we somehow stop receiving messages from EventHub due to a network partition, and not that the instance itself might fail. And in the example you provided, if I understand it correctly, in the case of a network partition between EventHub and StatefulSet instance, that instance still wouldn't be able to realize any misconfiguration (and events would be accumulating in that particular EventHub partition).

In my opinion, in an ideal scenario, EPH would ping EventHub in a specified interval, and after a pre-specified number of failures, a provided callback (that could be given in a constructor) would be called. That could be implemented not only for EventHub connection itself but maybe also for other network dependencies, like leases and checkpoints.

ragoragino avatar Jul 09 '19 14:07 ragoragino

I do consider the feature set of EPH to be useful, but I would like to implement it differently than it is currently implemented. The default implementation uses Azure Storage to durably persist checkpoints for a partition consumer. I would rather build a ring like implementation where the consumers built their own quorum, would coordinate lease acquisition and checkpoint persistence.

"when an instance fails" mean to me the partition consumer has run into an error it can not handle / recover from (network partition, period of time receiving a 500 from the service, etc...). I think this is equivalent to "somehow stop receiving messages".

In my opinion, in an ideal scenario, EPH would ping EventHub in a specified interval, and after a pre-specified number of failures, a provided callback (that could be given in a constructor) would be called. That could be implemented not only for EventHub connection itself but maybe also for other network dependencies, like leases and checkpoints.

👍

Would it also be helpful to produce opencensus / open tracing metrics so that the number of messages being consumed would output as a metric to an observation / alerting service in say Azure Application Insights or https://grafana.com? At this point, we are only outputting traces spans. I was debating the idea of adding metrics.

devigned avatar Jul 11 '19 16:07 devigned