pharo icon indicating copy to clipboard operation
pharo copied to clipboard

Release waiting processes when destroying a socket (without ThreadedFFI dependency)

Open daniels220 opened this issue 1 year ago • 2 comments

Plus a couple related refactoring opportunities I spotted at the last moment.

daniels220 avatar Feb 15 '24 05:02 daniels220

There is a seemingly related test failure. Ideas?

Error
Assertion failed
Salida
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest

guillep avatar Feb 15 '24 16:02 guillep

Hmm...when I ran the test in my image, it passed, but if I run it many times in a row (by script, not clicking the button—100 timesRepeat: [(SocketStreamTest selector: #testFlushOtherEndClosed) debug]), it usually fails the second time—sometimes it takes more runs, but it does always fail eventually. So it seems that something about...maybe the previous run's sockets having just recently closed?...makes it so that it hasn't "realized" the connection is closed by the time the second flush is executed. (I imagine in the CI build, having run a bunch of other sockets-related tests right before it had the same effect as running the one test many times.)

Since writing this, another issue led me to realize that waitForSendDone... is a misnomer—it's not waiting for sending to complete, just for the send buffer to have room to send more. As such, it's possible to send some data and have sendDone answer true immediately, so long as what you send isn't enough to fill the send buffer. So in fact, when running this test, waitForSendDone... never actually waits, and it's just a matter of whether the internal data structures of the socket are updated in time for the socket to be considered closed by the time the second send is initiated.

As far as making the test reliable, off the top of my head the only thing I can think of is to simply insert a wait in between the two flush calls. Even 15ms is enough for the test to pass 1000x in a row on my machine, but I see other tests using 100ms, and that might be safer on slower machines, other OSes (I'm testing this on macOS), etc. Went with that and pushed a commit.

I also created #16196—I think we should rename sendDone to something that better reflects its actual meaning—but that's a separate job.

daniels220 avatar Feb 17 '24 04:02 daniels220

I restarted the CI since it has been a while and this can fix some failing tests

jecisc avatar Mar 25 '24 20:03 jecisc

There is still a failing socket test

osx-64 / Tests-osx-64 / testNextPutAllFlushOtherEndClosed – MacOSX64.Network.Tests.SocketStreamTest6s Error Assertion failed Stacktrace TestFailure Assertion failed SocketStreamTest(TestAsserter)>>assert:description:resumable: SocketStreamTest(TestAsserter)>>assert:description: SocketStreamTest(Object)>>assert: SocketStreamTest(TestAsserter)>>should:raise: SocketStreamTest>>testNextPutAllFlushOtherEndClosed SocketStreamTest(TestCase)>>performTest

jecisc avatar Mar 25 '24 21:03 jecisc

Aha—it's the same issue as with #testFlushOtherEndClosed. I added the wait here as well and it should be good now.

daniels220 avatar Mar 25 '24 22:03 daniels220

Now I see this one:

unix-64 / Tests-unix-64 / testFlushLargeMessageOtherEndClosed – Unix64.Network.Tests.SocketStreamTest10s Error Assertion failed Stacktrace TestFailure Assertion failed SocketStreamTest(TestAsserter)>>assert:description:resumable: SocketStreamTest(TestAsserter)>>assert:description: SocketStreamTest(Object)>>assert: SocketStreamTest(TestAsserter)>>should:raise: SocketStreamTest>>testFlushLargeMessageOtherEndClosed SocketStreamTest(TestCase)>>performTest

jecisc avatar Mar 26 '24 10:03 jecisc

Interesting...I believe this is because of differing defaults for SO_SNDBUF on different platforms[^1], and in particular Linux (at least the Debian 11 VM I tested on) has such a high default that my 256KiB test message actually fits in the buffer, invalidating the test. Seems appropriate to explicitly set SO_SNDBUF...at which point I discovered that the same Debian refuses to set any value below 416KiB, so I'm also reading the value back and making sure the message is larger. This now passes reliably on my machine.

[^1]: I checked with socket getOption: 'SO_SNDBUF' (and 'SO_RCVBUF' for good measure): * Windows: 65KiB send, 65KiB receive * macOS: ~144KiB (146988) send, ~399KiB (408300) receive (interestingly these are both multiples of 1361...sounds like there's a story there) * Debian 11: ~2.5MiB (2565KiB exactly) send, 128KiB receive

daniels220 avatar Mar 26 '24 20:03 daniels220

@daniels220 As GitHub does not notify a pull request’s author about this as far as I know, in case you’re not aware: GitHub now indicates ‘this branch has conflicts that must be resolved’. That seems to be due to the merge of pull request #16376.

Rinzwind avatar Apr 02 '24 20:04 Rinzwind

@Rinzwind Thanks for the heads-up—easy fix.

daniels220 avatar Apr 02 '24 23:04 daniels220

Thanks! There’s one test failure in build 6 you may still need to look at though.

Rinzwind avatar Apr 03 '24 07:04 Rinzwind

Thanks! There’s one test failure in build 6 you may still need to look at though.

Strangely this only breaks on windows with a timeout. Maybe the issue is somewhere else than the code?

TestTookTooMuchTime
Process>>pvtSignal:list:
SocketStream>>checkFlush
SocketStream>>nextPutAll:
[
			clientStream
				nextPutAll: (String new: bufferSize * 2 withAll: $a);
				flush ] in SocketStreamTest>>testFlushLargeMessageOtherEndClosed
[
	  aBlock value.
	  false ] in SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
FullBlockClosure(BlockClosure)>>on:do:
SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed

guillep avatar Apr 04 '24 13:04 guillep

I restarted a build to see if this is random or not.

I don't know it we should merge this one or not because of this test. It already fixes a lot of our current failing tests.

jecisc avatar Apr 05 '24 11:04 jecisc

The same failure happened again with the same stack

jecisc avatar Apr 07 '24 13:04 jecisc

I'm hoping to look at this tomorrow (Monday). I have a sneaking suspicion what the problem might be, but it's not something I was ever able to reproduce/verify—it would be great if this were a reproducible test case, but also, I could be completely off base.

daniels220 avatar Apr 08 '24 04:04 daniels220

Okay wow, no, it's very much not what I suspected. So on Windows, when you try to set SO_SNDBUF to a reasonable value, it instead gets set to a distinctly unreasonable value—something absolutely massive. The test isn't really "taking too long", it's trying to allocate a multi-gigabyte array, and it's a tossup whether that times out or fails with OOM.

So I can work around that by only trying to set the buffer on not-Windows (or just giving up entirely and accepting that I'll need to send an 8MB message to make the test work on Linux)...at which point I find out that Windows will happily say, "yup, I sent all that" when you make a send() call that massively exceeds the send buffer, rather than saying "I could send 65KiB of that, try again later for the rest". So there's no opportunity to then detect, "wait, the other end is closed" the next time around, because it only tries once. The only way to make the test work is to add a separate explicit send/flush...but that's precisely the point of the test, is that if you send enough data it absolutely should fail without that!

So there's an arguable VM/plugin bug in play—though I think it's more like some nasty Windows errata that the VM might want to work around. Certainly looking at the plugin code myself, I don't see anything suspicious, it must be that send() is just agreeing to something that it really shouldn't. And yet it also seems like, if the other end doesn't close, the data does all get sent with just that one send() call...what in the world is Windows doing?

So. Effectively this test is expected-failure on Windows, in that I don't know how to make the behavior it's testing work on Windows. It's not a regression from earlier Pharo versions, the test itself is new (my doing). Thoughts on how to proceed?

daniels220 avatar Apr 08 '24 20:04 daniels220

There’s a bug when setting SO_SNDBUF to a value with more than 4 digits, try:

(9999 to: 10000) collect: [ :value |
	(Socket new
		setOption: 'SO_SNDBUF' value: value;
		getOption: 'SO_SNDBUF') second ]

I get #(9999 131072) on macOS, and #(9999 808464433) on Windows. This seems to be related to the condition on line 1539 of ‘SocketPluginImpl.c’ as of Pharo VM commit 32b2be55c1.

Rinzwind avatar Apr 09 '24 23:04 Rinzwind

@Rinzwind Ouch, yes, I actually looked at the plugin code and was suspicious of those lines, but couldn't understand them well enough to specifically point to a bug. Looking at it now, yeah, wow, that line is completely confusing "length of a string" and "byte size of a type". Most 32-bit integers are longer than four decimal digits! (Of course strtol returns long, which might be 64 bits, but the same applies to 8 decimal digits not covering the 64-bit int range—in fact 8 decimal digits doesn't even cover 32!). As a stopgap measure, it might make sense to just remove that first condition, and simply check if the string is exclusively digits at any length. We would also technically need to check over/underflow as 32 characters can exceed even a 64-bit int.

Of course the right solution, as implied in the comments in the plugin code, is for the image to pass the values as their actual types, the primitive to convert them to appropriate encoding in a void* pointer, and pass that to the sqSocketSetOptions implementation which would use it pretty much as-is. Honestly this seems not especially hard? But I don't have the ability to compile the VM (I tried on a brand-new Debian 11 VM and the result segfaulted precisely when using sockets, go figure) and don't know quite enough C to go it alone.

But then this starts sounding an awful lot like what the FFI marshaling layer is already doing, which really brings me back around to a pet peeve of mine, that being, why is so much of the sockets implementation in primitives that are just thin wrappers over syscalls? I know we need a tiny bit of abstraction to handle different platforms, but that could be handled by switching to different helper classes or something in the image, with just a very very few primitives to handle the parts that absolutely cannot be done any other way (I'm thinking some of the stuff with aio in the name may qualify here, but I'd still want to keep as much as possible in the image).

In any case, we still have the weird behavior in Windows that we can have SO_SNDBUF at its default of 65536, send a multi-megabyte message, and Windows will happily accept the entire thing in one chunk (send(2) will return the full length you give it) and actually succeed in sending it, assuming the socket is not closed in the meantime. Where the hell is it putting all that data? Isn't this what the send buffer size is supposed to mean?

daniels220 avatar Apr 10 '24 03:04 daniels220

Another approach to solving the bug could be to change the condition on line 1539 to opt->isint with isint being a new field that indicates whether the option expects an int or not. The table ‘Socket-Level Options’ in section ‘2.10.16 Use of Options’ in the System Interfaces volume of POSIX.1-2017 has a column ‘Parameter Type’.

As for the Windows behavior, that’s not a newly introduced problem that might require reverting recent changes as far as I understand, so you may need to make #testFlushLargeMessageOtherEndClosed signal TestSkipped for now when running on Windows or change it to test the ‘current’ rather than ‘desired’ behavior with respect to the signaling of ConnectionClosed.

Rinzwind avatar Apr 10 '24 19:04 Rinzwind

Yes, explicitly specifying the type is clearly the correct behavior—that's more in line with what I was thinking of in terms of refactoring so the image passes the parameters as their correct type rather than strings. The declaration on the C side would still be needed for getsockopt(), in order to correctly interpret the return value, and we could use it for error-checking—if the image passes the incorrect type, fail the primitive rather than feeding garbage to setsockopt().

Makes sense re: skipping the test on Windows. Just did that.

daniels220 avatar Apr 10 '24 23:04 daniels220

The last build failed due to issue #16443. That seems to have been fixed so another build could be started but its Windows test run will probably fail now due to issue #16450.

Rinzwind avatar Apr 12 '24 19:04 Rinzwind

@jecisc No new build was started. I think you need to wait a bit longer for Jenkins to have processed the ‘REMOVED event’ before reopening.

I tried it in pull request #16449. I first closed and immediately reopened it. Then I closed it again and waited a bit longer before I reopened it. A new build was only started the second time. In the ‘Multibranch Pipeline Events’ log you can see how the events were processed, note how the processing of the first ‘REMOVED event’ started at 22:24:40 which is after the timestamp 22:24:38 of the first ‘UPDATED event’ so the pull request was already reopened:

[Fri Apr 12 22:24:40 CEST 2024] Received Pull request #16449 closed […] REMOVED event […] with timestamp Fri Apr 12 22:24:35 CEST 2024
[…]
  Getting remote pull request #16449...
[…]
No changes detected: PR-16449 […]
[…]
  1 pull requests were processed
[…]
[Fri Apr 12 22:24:42 CEST 2024] Pull request #16449 closed […] REMOVED event […] processed in 1.5 sec
[Fri Apr 12 22:24:43 CEST 2024] Received Pull request #16449 reopened […] UPDATED event […] with timestamp Fri Apr 12 22:24:38 CEST 2024
[…]
  Getting remote pull request #16449...
[…]
No changes detected: PR-16449 […]
[…]
  1 pull requests were processed
[…]
[Fri Apr 12 22:24:45 CEST 2024] Pull request #16449 reopened […] UPDATED event […] processed in 1.3 sec
[Fri Apr 12 22:25:26 CEST 2024] Received Pull request #16449 closed […] REMOVED event […] with timestamp Fri Apr 12 22:25:21 CEST 2024
[…]
  Getting remote pull request #16449...

  0 pull requests were processed
[…]
[Fri Apr 12 22:25:27 CEST 2024] Pull request #16449 closed […] REMOVED event […] processed in 0.83 sec
[Fri Apr 12 22:25:39 CEST 2024] Received Pull request #16449 reopened […] UPDATED event […] with timestamp Fri Apr 12 22:25:34 CEST 2024
[…]
  Getting remote pull request #16449...
[…]
Pull Request reopened: PR-16449 […]
Scheduled build for branch: PR-16449
[…]
  1 pull requests were processed
[…]
[Fri Apr 12 22:25:40 CEST 2024] Pull request #16449 reopened […] UPDATED event […] processed in 1.5 sec

Rinzwind avatar Apr 12 '24 21:04 Rinzwind

Tests passed! Except a known random failure.

Should we integrate?

jecisc avatar Apr 15 '24 22:04 jecisc

and now we have a conflict :P

estebanlm avatar Apr 16 '24 10:04 estebanlm

Right, I remember seeing the issue whose fix created that conflict. As it happens I have some further refinement, but in any case, conflict resolved.

daniels220 avatar Apr 16 '24 23:04 daniels220

It's green!

MarcusDenker avatar Apr 25 '24 14:04 MarcusDenker

There are no test results for Windows in the most recent build’s results though. The output of the ‘Allocate node’ step for the Windows test run in that build (build 14) shows ‘pharo-ci-jenkins2-win10’ was used (which doesn’t actually run the tests, see issue #15105):

Running on pharo-ci-jenkins2-win10 in C:\builds\workspace\est_and_branch_Pipeline_PR-16172

Rinzwind avatar Apr 25 '24 18:04 Rinzwind