Context deadline not respected when ringer buffer is full
Hello,
We notice some scenarios where requests take longer to complete than the context deadline. For example, setting a 10ms timeout but receiving responses in 1-2s. I think that it is because calls to add items to the ring buffer will block if there is no free slot. The response time is essentially unbounded in this scenario, and this causes problems for clients that really need to fail fast if they're not getting responses in an adequate amount of time.
I'm wondering if you seen this come up in practice and put any thought it how to fix? I don't see an easy way to fix this given that the request goroutine will wait on the conditional variable if the slot is full. What do you think?
Hi @davestibrany-dd,
Yes, you are right. I think the scenario is rare in practice, but it can happen if Redis malfunctions and the client continues to send requests without receiving responses. Then, after 1 second, the connection will be killed by the backgroundPing goroutine since there is no PONG response.
As you pointed out, the request goroutine will wait on the conditional variable if the slot is full. So I think a fix should be replacing the conditional variable with an implementation that can respect context. We probably need to reimplement the conditional variable with channels.
Context should be respected.
But as @rueian said, the scenario is rare. Would monitoring ring buffer needed to?
Hi @proost, thanks for bringing back the proposal. I still have no idea what metrics we should track; however, we actually have a wrCounter counter already that is pretty close to what you described in the proposal.
https://github.com/redis/rueidis/blob/968b315f62e3d5fa20a5be94247c0e374091d373/pipe.go#L86
Starting monitor wrCounter looks good to me.
Because wrCounter tells how many commands are waiting. (either entering ring or waiting to responses)
But respecting context is first I think.
Hey all, I'm skeptical that this is a rare scenario. We do see the context not being respected in production in some of our production caches. We added a workaround that races the Rueidis call against the context deadline, and have noticed a reduction in occurrences. We still see it happen, but when it does that's likely due to CPU pressure at this point. Isn't it the case that anytime the ring buffer is full, requests will potentially block on the conditional variable indefinitely? I would think that this could commonly happen under various conditions, not just limited to Redis malfunctions.
I'm considering dropping the workaround we added (it's not efficient) and trying to bump the ring buffer size from 1024 to like 4096. Is this something you'd recommend as a short-term solution? I think the only downside is memory increase, but it's difficult to test if this would indeed fix this problem until we're in production.
Also curious if you have a timeline for when this issue could be fixed?
Hi @davestibrany-dd,
Yes, the ring buffer blocks indefinitely when it is full, which usually happens when the Redis server is malfunction. I am also curious what your workload looks like to have ring buffers being full frequently.
If it is about CPU pressure, I think the short term solution could be
- Use more CPUs or have larger GOMAXPROCS.
- Larger ring buffers. As you said that will increase memory footprint.
- Disable pipelining and use only connection pools.
Reimplementing the conditional variable and integrating it with the ring buffer is complex. We need to handle the cases where we left some ring slots empty carefully. I am currently have no timeline for that.
But could you share your current workaround that reduces the occurrences?
I am also curious what your workload looks like to have ring buffers being full frequently
We basically do caching-as-a-service internally for our company, so we have a really wide variety of use cases. I think it's likely CPU pressure.
could you share your current workaround that reduces the occurrences?
The work around is to race the Rueidis call against the context deadline that the caller provided, so all of our Rueidis calls are wrapped with this doAsync function.
func doAsync[T any](ctx context.Context, enabled bool, fn func() (*T, error)) (*T, error) {
if !enabled {
return fn()
}
redisDoneCh := make(chan struct{})
var reply *T
var err error
go func() {
defer close(redisDoneCh)
reply, err = fn()
}()
select {
case <-redisDoneCh:
return reply, err
case <-ctx.Done():
return nil, ctx.Err()
}
}
But it's not a great solution, in that it's extra overhead in the hot path, so I need to have it disabled for some high throughput, resource constrained users.
Reimplementing the conditional variable and integrating it with the ring buffer is complex
Yea, I can totally see that it's not a small lift. One thing that would be really useful is stats on buffer size usage. Our team would likely have capacity to contribute to that over the next two months or so, which would better help us understand the impact of the issue.
@davestibrany-dd I and rueian added a new queue. PR you can use it when rueian release it.
@rueian Is there something else remaining to be done here?
@rueian Is there something else remaining to be done here?
After https://github.com/redis/rueidis/pull/896, we have a new queue implementation that respects contexts. I wish we could also make our old implementation respect context, even though it is challenging.