async-io
async-io copied to clipboard
fix: missing waker.wake
My analysis of the problem is that using the current Reactor::ticker
value in ticks
is not always correct. It is my understanding that when
placing a ticker value into Direction::ticks
there is a guarantee that
the waker.wake call has completed and the associated task has been run.
The problem is the increment of the ticker value is in ReactorLock::react
is typically running on a different CPU thread than the
Source::poll_ready
(called by poll_readable
, poll_writeable
), which
is saving the ticker value in ticks:
state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));
Because the increment and use of ticker value are on different threads
the required guarantee can not be full filled and I'm proposing the
following fix in this PR, which only uses the a current state[dir].tick
and not the ticker value:
state[dir].ticks = Some((state[dir].tick, 0)
fix #78
@taiki-e any chance you can take a look at this?
@dignifiedquire, would it be be better to discuss this on issue #78, as there maybe better solutions and/or other related issues?
Sorry, I don't have time to investigate this problem at this time.
r? @smol-rs/admins
This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))
), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,
This appears to reduce the amount of completions (e.g.
Poll::Ready(Ok(()))
), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,
I might be mistaken, but it seems you're saying this change will cause additional missed wakes. But in the particular case I've seen it fixes a missed wake. There certainly maybe better ways to fix this, but it is my belief this decreases the likelihood of missed wakes.
Theoretically, the problem shouldn't happen, because Reactor::get().ticker()
uses SeqCst
, which enforces a global ordering across threads for that value, and because self.state
is locked (it's a Mutex
) at that time, which means [dir].tick
can't race, either. A call to the waker does only schedule the task, which means that there is no guarantee when the associated task itself gets polled, or in which order. only that the Poll::Pending
will eventually bubble up and cause the futures executor/queue to get processed. The problem with this "fix" is that it basically removes an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())
), and probably hides the original source of the error, which I would suspect to be a missing waker call somewhere else, or a duplicate readiness check (which would usually cause a missing or delayed wake call (because it consumes a ready event, and might cause a dead-lock)). This "fix" replaces this tick optimization by much more polling afaik, because every second poll of the containing Future
would cause a ready event when ticks
is set (that is, after a poll_ready
call) and tick
was updated (the source received an event). I suspect this might cause an additional spurious wakeup... (because it would also count events delivered by the previous reactor tick).
Also, even if this PR fixes the problem, or at least it's symptoms (with the downside of probably reduced performance), this PR basically obscures the code, because the ticks: Option<(usize, usize)>
doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option<usize>
, for symmetry: at the two locations in that file where this pattern is used.
Don't let this discourage you, I'm mostly just trying to understand where this problem stems from, and how to keep the code legible.
In my initial comment on this PR I say:
It is my understanding that when placing a ticker value into Direction::ticks there is a guarantee that the waker.wake call has completed and the associated task has been run.
Is this a valid statement or am I mistaken?
when placing a ticker value into Direction::ticks there is a guarantee that the waker.wake call has completed
afaik correct (but I'm not really familiar with this code base, and have only looked at reactor.rs
)
and the associated task has been run.
This doesn't hold (⇒ afaik is generally not true), because in many cases calling the waker only leads to "marking" of a task as to-be-scheduled in the associated reactor, which often also involves a queue, and thus gives no guarantee when (or if at all!) the future-of-interest will be polled again (because of "any-future-of" constructs and such). That is, even if the task gets rescheduled, the future-of-interest might get discarded if an alternative future is polled earlier in that task-execution. (This is sort-of theoretical in this case, because that shouldn't cause dead-locks or delays). And we usually don't know when the (now assumed to be scheduled) task gets polled again.
I agree with you, the code has been queued, but there is no guarantee it ran. So the act of sending the wake event only says at some point in the future the other thread will run but on the async-io side there is, as you say;
"an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())),"
That "optimization" is the cause of the bug, as now a (Poll::Ready(Ok()) will not be returned when it should have been.
I believe you probably are correct when you say;
"the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option
, for symmetry: at the two locations in that file where this pattern is used."
I will remove the second usize if desired.
I will remove the second usize if desired.
Please do that.
@zseri, Ok, I'll remove the second usize. Is it your opinion I have identified a bug?
could you push (or force-push, if you like) it to this branch (as the original fix is embedded in the associated issue, no info would be lost)? That would make it easier to observe the interaction with the remaining parts of reactor.rs
. It would also be nice if we could then construct a test case that attempts to trigger it, even if it isn't deterministic.
I'll try to do that tomorrow. Can you point me to a test that might be an appropriate to use as a model the new test?
hm tbh I don't really know how to write a test for that, because it involves many complex parts;
afaik we can't use a mock network stream or such (which usually wouldn't contain much async stuff at all), because we need to reach to interesting bits of async-io
, i.e. the reactor, it might be possible to simulate it using a pipe or so, though...;
the original scenario in which this was triggered seems complex too, and I'm not sure how a "good" MWE for that could be constructed even when allowing all kinds of hard-coded network stuff...
getting this to hit the issue most of the time, instead of just sometimes is probably even harder (especially because it involves potential (?) race conditions, or at least appears to).
(other than that, the currently existing integration tests in tests/
are probably a good starting point)
TL;DR: idk, if you manage to get lucky and write something which manages to hit this sometimes, then add a test which does that, otherwise... "just" don't. but if you're not able to write a test, at least try to document the reasoning behind this in the commit message, so that in case this introduces a perf regression, it's easier to tackle that while knowing why this optimization was thrown out, and what might be necessary to get fixed elsewhere to be able to reintroduce it without also reintroducing this issue.
Txs for the info, it was fairly difficult to narrow done the problem, but one good thing was that as I added more debug logging the problem didn't go away!
As we both suspect creating a test is likely going to be non-trivial and take significant effort on my part. It would be nice if someone with commit rights could provide initial feedback that this could be merged. Who might be interested in doing that quick initial review?
@yoshuawuyts @zeenix I'm not understanding the impact of this well enough, and wasn't able to fully understand the initial scenario in which the issue was noticed. This PR doesn't look like it should break anything, just impact performance negatively, but I'm not completely sure... my most important concerns are listed in https://github.com/smol-rs/async-io/pull/79#issuecomment-1160157232. Can someone else review this?
Is there still an interest in fixing this? You could probably resolve this by moving ticker
into the Mutex
and repopulating an atomic variable used by the Source
s when the ReactorLock
is dropped.
Either way, I'd like a test case of some kind before we move too far in this direction. Maybe it would be advantageous to incorporate loom
.
Regarding https://github.com/smol-rs/async-io/pull/79#issuecomment-1160157232, maybe we can move the ticker
into the Mutex
as mentioned. idk how exactly that would impact the rest of the codebase, tho. But keeping an atomic value and a value in a Mutex synchronized is a maintenance hazard. If possible to move the ticker behind a mutex, we should imo do that (probably not much worse performance-wise than using SeqCst); But I'm not really sure if we understand the originating problem correctly. Can we maybe try to temporarily add tracing
to the deps of this crate in a separate branch, or behind a feature-flag, and add debug messages at every place where affected variables are used, and apply this to a minimal test case to get some state logs, because this looks like some internal state transition sometimes going wrong (or maybe some racing off-by-one thingy). I don't really understand the problem good enough yet, and this "fix" basically nullifies an existing optimization in the code base.
@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".
@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".
Here is a branch with a bunch of debug, it needs to be rebased on top of master
.
I ran into this problem while following the hole_punching
tutorial for rust-libp2p
. This has been discussed in this post relay_v2 performance anomaly on small machines.
One other thing that I have found helpful after working on this was to have a custom logger which adds time in nanos, line numbers and thread id to the log entry. This did/does require an unstable build because thread_id_value uses as_u64(), but it might be worth it.
Oops accidentally closed.