gvisor POLLRDHUP delivery fix

Hi,

I am proposing a change that should fix issue #6015.

As per my understanding POLLRDHUP triggers when there is a half-closed TCP connection (either remote shutdown(SHUT_WR) or local shutdown(SHUT_RD)), therefore I have added a new equivalent event to waiter - waiter.EventRdHUp and added the event to allEvents and a "list" of events that can get triggered on read in sys_read.go. The TCP endpoint now checks whether the connection is half-closed and sets the appropriate flag. Additionally, TCP shutdown function notifies when only read end of the endpoint connection is closed. I have updated the syscall tests to reflect this change.

It is possible that some things are missing here but I will gladly update the PR if something was overlooked.

Let me know what you think.

Mar 31 '22 20:03 turekt

I have added the tests. Let me know what you think.

May 01 '22 15:05 turekt

Hi @hbhasker, I understand that you might be busy. I have noticed that it is soon going to be a year since this PR was created. I was wondering whether you could find the time to review this PR?

Thank you in advance.

Jan 23 '23 21:01 turekt

Hi @kevinGC, thanks for your review. I have changed the code per your suggestions.

Let me know what you think.

Jan 25 '23 19:01 turekt

Been trying to merge this, but running into a failing test internally. Should be able to reproduce it by running this:

bazel test //test/syscalls:socket_inet_loopback_test_runsc_ptrace --runs_per_test=1000

POLLRDHUP isn't being returned all the time -- it's flaky.

I believe this is happening because the endpoint state e.state changes independently of e.connectionDirectionState. Because they aren't synchronized, it's possible for this sequence to occur:

e.state is updated
e.Readiness is called, and returns only POLLIN/EventIn
The test poll() returns only POLLIN
e.connectionDirectionState is set to connDirectionStateRcvClosed, but it's too late

We might have to synchronize these values, or take a look at how others handle the problem. Not 100% sure this is the issue, but it's my current theory.

Feb 22 '23 01:02 kevinGC

Hi @kevinGC, thanks for letting me know!

I managed to reproduce the failing test. I made a small change to the commit and retested. My first test run finished successfully, but I am running a second one just in case. The --runs_per_test=1000 is very slow for me so I was hoping that we could quickly verify the second test run with copybara if possible.

Hopefully this change fixes the POLLRDHUP issue.

Feb 28 '23 21:02 turekt

Hi @kevinGC, the second test run is finished and successful! The latest changes should be a complete fix for the POLLRDHUP issue.

Mar 06 '23 16:03 turekt

Hi @kevinGC, any updates?

Jul 09 '23 14:07 turekt