async-io icon indicating copy to clipboard operation
async-io copied to clipboard

fix: missing waker.wake

Open winksaville opened this issue 2 years ago • 24 comments

My analysis of the problem is that using the current Reactor::ticker value in ticks is not always correct. It is my understanding that when placing a ticker value into Direction::ticks there is a guarantee that the waker.wake call has completed and the associated task has been run. The problem is the increment of the ticker value is in ReactorLock::react is typically running on a different CPU thread than the Source::poll_ready (called by poll_readable, poll_writeable), which is saving the ticker value in ticks:

state[dir].ticks = Some((Reactor::get().ticker(), state[dir].tick));

Because the increment and use of ticker value are on different threads the required guarantee can not be full filled and I'm proposing the following fix in this PR, which only uses the a current state[dir].tick and not the ticker value:

state[dir].ticks = Some((state[dir].tick, 0)

fix #78

winksaville avatar May 06 '22 17:05 winksaville

@taiki-e any chance you can take a look at this?

dignifiedquire avatar May 17 '22 15:05 dignifiedquire

@dignifiedquire, would it be be better to discuss this on issue #78, as there maybe better solutions and/or other related issues?

winksaville avatar May 17 '22 15:05 winksaville

Sorry, I don't have time to investigate this problem at this time.

taiki-e avatar Jun 15 '22 03:06 taiki-e

r? @smol-rs/admins

taiki-e avatar Jun 15 '22 03:06 taiki-e

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

fogti avatar Jun 17 '22 07:06 fogti

This appears to reduce the amount of completions (e.g. Poll::Ready(Ok(()))), which irritates me because the outer stuff should call it again if expected events didn't occur. What irritates me a bit is that in case the "tick expired" (event delivered for newer, "non-anticipated" tick), we don't call the waker (even though it looks like it might happen that we would then either not call the waker at all or too late), but we call it if we replace it, which appears a bit weird... but calling a waker when it doesn't need to be called usually is better than not calling it when it may be necessary,

I might be mistaken, but it seems you're saying this change will cause additional missed wakes. But in the particular case I've seen it fixes a missed wake. There certainly maybe better ways to fix this, but it is my belief this decreases the likelihood of missed wakes.

winksaville avatar Jun 19 '22 20:06 winksaville

Theoretically, the problem shouldn't happen, because Reactor::get().ticker() uses SeqCst, which enforces a global ordering across threads for that value, and because self.state is locked (it's a Mutex) at that time, which means [dir].tick can't race, either. A call to the waker does only schedule the task, which means that there is no guarantee when the associated task itself gets polled, or in which order. only that the Poll::Pending will eventually bubble up and cause the futures executor/queue to get processed. The problem with this "fix" is that it basically removes an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())), and probably hides the original source of the error, which I would suspect to be a missing waker call somewhere else, or a duplicate readiness check (which would usually cause a missing or delayed wake call (because it consumes a ready event, and might cause a dead-lock)). This "fix" replaces this tick optimization by much more polling afaik, because every second poll of the containing Future would cause a ready event when ticks is set (that is, after a poll_ready call) and tick was updated (the source received an event). I suspect this might cause an additional spurious wakeup... (because it would also count events delivered by the previous reactor tick).

Also, even if this PR fixes the problem, or at least it's symptoms (with the downside of probably reduced performance), this PR basically obscures the code, because the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option<usize>, for symmetry: at the two locations in that file where this pattern is used.

Don't let this discourage you, I'm mostly just trying to understand where this problem stems from, and how to keep the code legible.

fogti avatar Jun 20 '22 08:06 fogti

In my initial comment on this PR I say:

It is my understanding that when placing a ticker value into Direction::ticks there is a guarantee that the waker.wake call has completed and the associated task has been run.

Is this a valid statement or am I mistaken?

winksaville avatar Jun 20 '22 14:06 winksaville

when placing a ticker value into Direction::ticks there is a guarantee that the waker.wake call has completed

afaik correct (but I'm not really familiar with this code base, and have only looked at reactor.rs)

and the associated task has been run.

This doesn't hold (⇒ afaik is generally not true), because in many cases calling the waker only leads to "marking" of a task as to-be-scheduled in the associated reactor, which often also involves a queue, and thus gives no guarantee when (or if at all!) the future-of-interest will be polled again (because of "any-future-of" constructs and such). That is, even if the task gets rescheduled, the future-of-interest might get discarded if an alternative future is polled earlier in that task-execution. (This is sort-of theoretical in this case, because that shouldn't cause dead-locks or delays). And we usually don't know when the (now assumed to be scheduled) task gets polled again.

fogti avatar Jun 21 '22 07:06 fogti

I agree with you, the code has been queued, but there is no guarantee it ran. So the act of sending the wake event only says at some point in the future the other thread will run but on the async-io side there is, as you say;

"an optimization which suppresses supposedly unnecessary/currently-consumed ready events (Poll::Ready(Ok())),"

That "optimization" is the cause of the bug, as now a (Poll::Ready(Ok()) will not be returned when it should have been.

I believe you probably are correct when you say;

"the ticks: Option<(usize, usize)> doesn't really represent what's going on anymore... and should instead also be replaced by ticks: Option, for symmetry: at the two locations in that file where this pattern is used."

I will remove the second usize if desired.

winksaville avatar Jun 21 '22 12:06 winksaville

I will remove the second usize if desired.

Please do that.

fogti avatar Jun 21 '22 12:06 fogti

@zseri, Ok, I'll remove the second usize. Is it your opinion I have identified a bug?

winksaville avatar Jun 21 '22 15:06 winksaville

could you push (or force-push, if you like) it to this branch (as the original fix is embedded in the associated issue, no info would be lost)? That would make it easier to observe the interaction with the remaining parts of reactor.rs. It would also be nice if we could then construct a test case that attempts to trigger it, even if it isn't deterministic.

fogti avatar Jun 21 '22 18:06 fogti

I'll try to do that tomorrow. Can you point me to a test that might be an appropriate to use as a model the new test?

winksaville avatar Jun 21 '22 18:06 winksaville

hm tbh I don't really know how to write a test for that, because it involves many complex parts; afaik we can't use a mock network stream or such (which usually wouldn't contain much async stuff at all), because we need to reach to interesting bits of async-io, i.e. the reactor, it might be possible to simulate it using a pipe or so, though...; the original scenario in which this was triggered seems complex too, and I'm not sure how a "good" MWE for that could be constructed even when allowing all kinds of hard-coded network stuff... getting this to hit the issue most of the time, instead of just sometimes is probably even harder (especially because it involves potential (?) race conditions, or at least appears to). (other than that, the currently existing integration tests in tests/ are probably a good starting point) TL;DR: idk, if you manage to get lucky and write something which manages to hit this sometimes, then add a test which does that, otherwise... "just" don't. but if you're not able to write a test, at least try to document the reasoning behind this in the commit message, so that in case this introduces a perf regression, it's easier to tackle that while knowing why this optimization was thrown out, and what might be necessary to get fixed elsewhere to be able to reintroduce it without also reintroducing this issue.

fogti avatar Jun 21 '22 18:06 fogti

Txs for the info, it was fairly difficult to narrow done the problem, but one good thing was that as I added more debug logging the problem didn't go away!

winksaville avatar Jun 21 '22 19:06 winksaville

As we both suspect creating a test is likely going to be non-trivial and take significant effort on my part. It would be nice if someone with commit rights could provide initial feedback that this could be merged. Who might be interested in doing that quick initial review?

winksaville avatar Jun 21 '22 20:06 winksaville

@yoshuawuyts @zeenix I'm not understanding the impact of this well enough, and wasn't able to fully understand the initial scenario in which the issue was noticed. This PR doesn't look like it should break anything, just impact performance negatively, but I'm not completely sure... my most important concerns are listed in https://github.com/smol-rs/async-io/pull/79#issuecomment-1160157232. Can someone else review this?

fogti avatar Jun 21 '22 23:06 fogti

Is there still an interest in fixing this? You could probably resolve this by moving ticker into the Mutex and repopulating an atomic variable used by the Sources when the ReactorLock is dropped.

Either way, I'd like a test case of some kind before we move too far in this direction. Maybe it would be advantageous to incorporate loom.

notgull avatar Sep 07 '22 20:09 notgull

Regarding https://github.com/smol-rs/async-io/pull/79#issuecomment-1160157232, maybe we can move the ticker into the Mutex as mentioned. idk how exactly that would impact the rest of the codebase, tho. But keeping an atomic value and a value in a Mutex synchronized is a maintenance hazard. If possible to move the ticker behind a mutex, we should imo do that (probably not much worse performance-wise than using SeqCst); But I'm not really sure if we understand the originating problem correctly. Can we maybe try to temporarily add tracing to the deps of this crate in a separate branch, or behind a feature-flag, and add debug messages at every place where affected variables are used, and apply this to a minimal test case to get some state logs, because this looks like some internal state transition sometimes going wrong (or maybe some racing off-by-one thingy). I don't really understand the problem good enough yet, and this "fix" basically nullifies an existing optimization in the code base.

fogti avatar Sep 07 '22 21:09 fogti

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

fogti avatar Sep 07 '22 21:09 fogti

@winksaville you said you added copious amount of logging to the library stack, do you still have the patches for that laying around? I would like to try debugging this myself, but don't want to start at "square zero".

Here is a branch with a bunch of debug, it needs to be rebased on top of master.

winksaville avatar Sep 07 '22 22:09 winksaville

I ran into this problem while following the hole_punching tutorial for rust-libp2p. This has been discussed in this post relay_v2 performance anomaly on small machines.

One other thing that I have found helpful after working on this was to have a custom logger which adds time in nanos, line numbers and thread id to the log entry. This did/does require an unstable build because thread_id_value uses as_u64(), but it might be worth it.

winksaville avatar Sep 07 '22 23:09 winksaville

Oops accidentally closed.

winksaville avatar Sep 07 '22 23:09 winksaville