POLLRDHUP delivery fix
Hi,
I am proposing a change that should fix issue #6015.
As per my understanding POLLRDHUP triggers when there is a half-closed TCP connection (either remote shutdown(SHUT_WR) or local shutdown(SHUT_RD)), therefore I have added a new equivalent event to waiter - waiter.EventRdHUp and added the event to allEvents and a "list" of events that can get triggered on read in sys_read.go. The TCP endpoint now checks whether the connection is half-closed and sets the appropriate flag. Additionally, TCP shutdown function notifies when only read end of the endpoint connection is closed. I have updated the syscall tests to reflect this change.
It is possible that some things are missing here but I will gladly update the PR if something was overlooked.
Let me know what you think.
I have added the tests. Let me know what you think.
Hi @hbhasker, I understand that you might be busy. I have noticed that it is soon going to be a year since this PR was created. I was wondering whether you could find the time to review this PR?
Thank you in advance.
Hi @kevinGC, thanks for your review. I have changed the code per your suggestions.
Let me know what you think.
Been trying to merge this, but running into a failing test internally. Should be able to reproduce it by running this:
bazel test //test/syscalls:socket_inet_loopback_test_runsc_ptrace --runs_per_test=1000
POLLRDHUP isn't being returned all the time -- it's flaky.
I believe this is happening because the endpoint state e.state changes independently of e.connectionDirectionState. Because they aren't synchronized, it's possible for this sequence to occur:
e.stateis updatede.Readinessis called, and returns onlyPOLLIN/EventIn- The test
poll()returns onlyPOLLIN e.connectionDirectionStateis set toconnDirectionStateRcvClosed, but it's too late
We might have to synchronize these values, or take a look at how others handle the problem. Not 100% sure this is the issue, but it's my current theory.
Hi @kevinGC, thanks for letting me know!
I managed to reproduce the failing test. I made a small change to the commit and retested. My first test run finished successfully, but I am running a second one just in case. The --runs_per_test=1000 is very slow for me so I was hoping that we could quickly verify the second test run with copybara if possible.
Hopefully this change fixes the POLLRDHUP issue.
Hi @kevinGC, the second test run is finished and successful! The latest changes should be a complete fix for the POLLRDHUP issue.
Hi @kevinGC, any updates?