eventstore icon indicating copy to clipboard operation
eventstore copied to clipboard

Change partition_by to return a random partition instead of a nil value for a parition_by function

Open cblage opened this issue 10 months ago • 4 comments

This fixes an issue we were facing where the partition_key being nil meant that since

in_partition?(%Subscriber{partition_key: nil}, _partition_key), do: false

Then any events with a nil partition_key would exhaust the available subscribers - draining them from other queues - and especially in our case where 99% of events were going to the nil queue - meaning this queue always get priority vs other queues in terms of consuming events. This happened due to in EventStore.Subscriptions.SubscriptionFsm.next_available_subscriber/2 this never resolving into an existing queue:

 partition_subscriber =
      Enum.find(subscribers, fn {_pid, subscriber} ->
        Subscriber.in_partition?(subscriber, partition_key)
      end)

which meant that events for the nil queue would always grab a new Subscriber instead of grouping up with an existing one already processing a nil queue event. This exhausted the subscribers and prevented other queues with less events from being processed.

This fix prevents the issue by replacing nil partition_keys with a random partition based on the max_size that will allow the event queues to be processed with equal priority when partition_by is nil or returns nil.

This nil behavior was kept for when partition_by is nil since it optimizes the parallelism of the "default" behavior when no partition_by is provided.

cblage avatar Jan 31 '25 21:01 cblage

Sorry, need to fix the tests

cblage avatar Feb 01 '25 00:02 cblage

I updated the changes a bit and fixed the tests

cblage avatar Feb 03 '25 19:02 cblage

Hello @cblage, thanks for putting the work in here.

I see the problem, where we need to be able to identify that we'd like the default partition strategy to be used, and then a custom strategy that may return nil and thus make very uneven distributions.

I'm worried that returning a random partition might be surprising behaviour in the case of a nil.

Is there a reason you wouldn't handle this in your application code? You would opt into the fallback behaviour that makes the most sense for you.

I agree that this is a bit of a foot-gun here though. Perhaps we should raise an error if partition_by/1 returns nil? This would give the developer the chance to catch this issue early and address it before going to production.

WDYT?

drteeth avatar Feb 11 '25 18:02 drteeth

Hi @drteeth!

Throwing an error when partition_by returns nil might be a better approach. We dealt with the problem on the application side, but it was difficult to debug why certain events were not getting processed in time - so we decided to contribute with this patch to prevent others from having the same problem.

cblage avatar Feb 11 '25 18:02 cblage