V4 Job Completion Duration Performance Hit
Hi,
just wanted to share some data on our measured impact of the new notification system on the job completion duration:
We're observing a roughly 400% increase at p90 (from 50ms to 200ms), and with occasional spikes to 500ms.
This is with a
1 job inserted / second and 1 job exec / second.
Big fan of you work :+1: - this isn't really a dealbreaker, I'm curious to hear about your internal discussions around the performance tradeoffs of this new notification system. Was the previous NOTIFY implementation so penalizing at a certain scale that you felt you should make this change?
Thanks
Hi, thanks for reporting this! Can you provide some additional info to help us track down the source of this regression?
- Which version of River were you on previously?
- How many nodes (Clients) are you using to work jobs on this queue? How many clients are inserting jobs?
- Can you confirm what duration is being measured here / how it is being measured? Is this measuring from the time difference between
scheduled_atandcompleted_atfor a job that is inserted as available to be worked immediately? - What settings are you putting on your River client configs? Any customization of poll intervals or anything like that?
There are a few reasons behind the change. One of them is that yes, there is a lot of overhead for a high throughput system in doing a NOTIFY on every insert. Another is that there were some mistakes / shortcomings in the original implementation that prevented a fully functional setup using multiple clients running on totally isolated Postgres schemas, basically the fact that notifications are a global namespace across all schemas & search paths but our notifications weren’t properly namespaced. We wanted to fix this as well.
I have a hunch that the specific number of clients and insert rate in this example is basically at odds with the chosen default debounce timer in the app level notifier. Hopefully it’s something we can trivially improve by tuning a couple of internal settings or worst case by exposing a new setting.
hI,
- we were on v0.0.20 prior to this
- we have a single consumer that consumes from ~5 queues, and the majority of our jobs trigger a new job insertion within the same transaction.
- Another client is a producer_only, but produces at a much lower rate.
- the duration is computed using the river
event.JobStats.CompleteDuration.Seconds() - The consumer/producer uses the following config:
riverConfig := &river.Config{
JobTimeout: -1,
RescueStuckJobsAfter: 5 * time.Minute,
Queues: queueNameToQueueConfig,
Workers: workers,
ErrorHandler: s,
}
- The producer only uses a blank config
riverConfig := &river.Config{}
@malonaz ah, I may have misinterpreted this report initially. This CompleteDuration stat should have nothing to do with the changes to the LISTEN/NOTIFY mechanism. This stat is measuring the time between when the worker's Work() function finishes executing, and when River finishes marking the job as completed in the database.
Instead of being the result of #301 and the v4 migration, I suspect you are seeing this stat increase as a result of the changes in #258 to introduce a batching async job completer. This change has the effect of tremendously increasing total throughput with the tradeoff of some additional latency, because completed jobs are batched up in memory in the client to be marked completed in groups. You can see the core of how it works here https://github.com/riverqueue/river/blob/30a97ff9f18817e531bd5bd5fc305d00987dd5d0/internal/jobcompleter/job_completer.go#L160-L266
As of now, there are no user-facing knobs for tuning this behavior. We even still have the old async completer in code, as well as an inline one (the latter has much worse throughput). There is no way to activate these as a user though, and no way to customize the thresholds for batching.
I wonder if @brandur has thoughts on whether this much of an increase is expected, and whether/how we might want to allow users to customize the behavior here?
@malonaz Could you add some more detail on the throughput rate for these queues? Additionally, I want to ask about this bit:
we have a single consumer that consumes from ~5 queues, and the majority of our jobs trigger a new job insertion within the same transaction.
The phrase "the same transaction" struck me once I realized you were measuring using the CompleteDuration stat. Jobs are not worked within a transaction and their completion is not tied to any transaction that occurs within the job itself. However, River does offer a transactional job completion feature that allows you to make other database state changes alongside the job being marked as completed, ensuring that all those changes happen atomically. Are you making use of that feature here? If so, the increase in this measurement won't actually be impacting you at all, at least not for the jobs where you are completing them within Work().
I wonder if @brandur has thoughts on whether this much of an increase is expected, and whether/how we might want to allow users to customize the behavior here?
Yeah, this is a little nuanced, but although the batch completer's ticker fires every 50 ms, it'll only complete a batch every 50 ms if a minimum batch size has accumulated during that time. Otherwise it waits until a tick that's a multiple of 5 to complete a batch. So you can expect up to a 5 * 50 = 250 ms delay.
Let me caveat that though to say that the delay doesn't actually matter — the job still finished in the same amount of time, and it won't be reworked (unless there's a hard crash). It's just that setting it to fully completed in the database might take a little bit longer.
I'd hold off on further customization unless we find a good reason for such. The defaults should be pretty good for everyone.
In terms of measuring statistics, it migh t make more sense to observe QueueWaitDuration + RunDuration as a more meaningful number for how long jobs are taking to run (as opposed to CompleteDuration).
@malonaz haven't heard anything back on whether or not you still believe there's an issue here, though I think it's mostly a matter of how the durations were being measured. Please let us know if you think otherwise! ❤️