Change partition_by to return a random partition instead of a nil value for a parition_by function
This fixes an issue we were facing where the partition_key being nil meant that since
in_partition?(%Subscriber{partition_key: nil}, _partition_key), do: false
Then any events with a nil partition_key would exhaust the available subscribers - draining them from other queues - and especially in our case where 99% of events were going to the nil queue - meaning this queue always get priority vs other queues in terms of consuming events. This happened due to in EventStore.Subscriptions.SubscriptionFsm.next_available_subscriber/2 this never resolving into an existing queue:
partition_subscriber =
Enum.find(subscribers, fn {_pid, subscriber} ->
Subscriber.in_partition?(subscriber, partition_key)
end)
which meant that events for the nil queue would always grab a new Subscriber instead of grouping up with an existing one already processing a nil queue event. This exhausted the subscribers and prevented other queues with less events from being processed.
This fix prevents the issue by replacing nil partition_keys with a random partition based on the max_size that will allow the event queues to be processed with equal priority when partition_by is nil or returns nil.
This nil behavior was kept for when partition_by is nil since it optimizes the parallelism of the "default" behavior when no partition_by is provided.
Sorry, need to fix the tests
I updated the changes a bit and fixed the tests
Hello @cblage, thanks for putting the work in here.
I see the problem, where we need to be able to identify that we'd like the default partition strategy to be used, and then a custom strategy that may return nil and thus make very uneven distributions.
I'm worried that returning a random partition might be surprising behaviour in the case of a nil.
Is there a reason you wouldn't handle this in your application code? You would opt into the fallback behaviour that makes the most sense for you.
I agree that this is a bit of a foot-gun here though. Perhaps we should raise an error if partition_by/1 returns nil? This would give the developer the chance to catch this issue early and address it before going to production.
WDYT?
Hi @drteeth!
Throwing an error when partition_by returns nil might be a better approach. We dealt with the problem on the application side, but it was difficult to debug why certain events were not getting processed in time - so we decided to contribute with this patch to prevent others from having the same problem.