pharo
pharo copied to clipboard
Release waiting processes when destroying a socket (without ThreadedFFI dependency)
Plus a couple related refactoring opportunities I spotted at the last moment.
There is a seemingly related test failure. Ideas?
Error
Assertion failed
Salida
TestFailure
Assertion failed
SocketStreamTest(TestAsserter)>>assert:description:resumable:
SocketStreamTest(TestAsserter)>>assert:description:
SocketStreamTest(Object)>>assert:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushOtherEndClosed
SocketStreamTest(TestCase)>>performTest
Hmm...when I ran the test in my image, it passed, but if I run it many times in a row (by script, not clicking the button—100 timesRepeat: [(SocketStreamTest selector: #testFlushOtherEndClosed) debug]
), it usually fails the second time—sometimes it takes more runs, but it does always fail eventually. So it seems that something about...maybe the previous run's sockets having just recently closed?...makes it so that it hasn't "realized" the connection is closed by the time the second flush
is executed. (I imagine in the CI build, having run a bunch of other sockets-related tests right before it had the same effect as running the one test many times.)
Since writing this, another issue led me to realize that waitForSendDone...
is a misnomer—it's not waiting for sending to complete, just for the send buffer to have room to send more. As such, it's possible to send some data and have sendDone
answer true immediately, so long as what you send isn't enough to fill the send buffer. So in fact, when running this test, waitForSendDone...
never actually waits, and it's just a matter of whether the internal data structures of the socket are updated in time for the socket to be considered closed by the time the second send is initiated.
As far as making the test reliable, off the top of my head the only thing I can think of is to simply insert a wait in between the two flush
calls. Even 15ms is enough for the test to pass 1000x in a row on my machine, but I see other tests using 100ms, and that might be safer on slower machines, other OSes (I'm testing this on macOS), etc. Went with that and pushed a commit.
I also created #16196—I think we should rename sendDone
to something that better reflects its actual meaning—but that's a separate job.
I restarted the CI since it has been a while and this can fix some failing tests
There is still a failing socket test
osx-64 / Tests-osx-64 / testNextPutAllFlushOtherEndClosed – MacOSX64.Network.Tests.SocketStreamTest6s Error Assertion failed Stacktrace TestFailure Assertion failed SocketStreamTest(TestAsserter)>>assert:description:resumable: SocketStreamTest(TestAsserter)>>assert:description: SocketStreamTest(Object)>>assert: SocketStreamTest(TestAsserter)>>should:raise: SocketStreamTest>>testNextPutAllFlushOtherEndClosed SocketStreamTest(TestCase)>>performTest
Aha—it's the same issue as with #testFlushOtherEndClosed. I added the wait here as well and it should be good now.
Now I see this one:
unix-64 / Tests-unix-64 / testFlushLargeMessageOtherEndClosed – Unix64.Network.Tests.SocketStreamTest10s Error Assertion failed Stacktrace TestFailure Assertion failed SocketStreamTest(TestAsserter)>>assert:description:resumable: SocketStreamTest(TestAsserter)>>assert:description: SocketStreamTest(Object)>>assert: SocketStreamTest(TestAsserter)>>should:raise: SocketStreamTest>>testFlushLargeMessageOtherEndClosed SocketStreamTest(TestCase)>>performTest
Interesting...I believe this is because of differing defaults for SO_SNDBUF
on different platforms[^1], and in particular Linux (at least the Debian 11 VM I tested on) has such a high default that my 256KiB test message actually fits in the buffer, invalidating the test. Seems appropriate to explicitly set SO_SNDBUF
...at which point I discovered that the same Debian refuses to set any value below 416KiB, so I'm also reading the value back and making sure the message is larger. This now passes reliably on my machine.
[^1]: I checked with socket getOption: 'SO_SNDBUF'
(and 'SO_RCVBUF'
for good measure):
* Windows: 65KiB send, 65KiB receive
* macOS: ~144KiB (146988) send, ~399KiB (408300) receive (interestingly these are both multiples of 1361...sounds like there's a story there)
* Debian 11: ~2.5MiB (2565KiB exactly) send, 128KiB receive
@daniels220 As GitHub does not notify a pull request’s author about this as far as I know, in case you’re not aware: GitHub now indicates ‘this branch has conflicts that must be resolved’. That seems to be due to the merge of pull request #16376.
@Rinzwind Thanks for the heads-up—easy fix.
Thanks! There’s one test failure in build 6 you may still need to look at though.
Thanks! There’s one test failure in build 6 you may still need to look at though.
Strangely this only breaks on windows with a timeout. Maybe the issue is somewhere else than the code?
TestTookTooMuchTime
Process>>pvtSignal:list:
SocketStream>>checkFlush
SocketStream>>nextPutAll:
[
clientStream
nextPutAll: (String new: bufferSize * 2 withAll: $a);
flush ] in SocketStreamTest>>testFlushLargeMessageOtherEndClosed
[
aBlock value.
false ] in SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
FullBlockClosure(BlockClosure)>>on:do:
SocketStreamTest(TestAsserter)>>executeShould:inScopeOf:
SocketStreamTest(TestAsserter)>>should:raise:
SocketStreamTest>>testFlushLargeMessageOtherEndClosed
I restarted a build to see if this is random or not.
I don't know it we should merge this one or not because of this test. It already fixes a lot of our current failing tests.
The same failure happened again with the same stack
I'm hoping to look at this tomorrow (Monday). I have a sneaking suspicion what the problem might be, but it's not something I was ever able to reproduce/verify—it would be great if this were a reproducible test case, but also, I could be completely off base.
Okay wow, no, it's very much not what I suspected. So on Windows, when you try to set SO_SNDBUF
to a reasonable value, it instead gets set to a distinctly unreasonable value—something absolutely massive. The test isn't really "taking too long", it's trying to allocate a multi-gigabyte array, and it's a tossup whether that times out or fails with OOM.
So I can work around that by only trying to set the buffer on not-Windows (or just giving up entirely and accepting that I'll need to send an 8MB message to make the test work on Linux)...at which point I find out that Windows will happily say, "yup, I sent all that" when you make a send()
call that massively exceeds the send buffer, rather than saying "I could send 65KiB of that, try again later for the rest". So there's no opportunity to then detect, "wait, the other end is closed" the next time around, because it only tries once. The only way to make the test work is to add a separate explicit send/flush...but that's precisely the point of the test, is that if you send enough data it absolutely should fail without that!
So there's an arguable VM/plugin bug in play—though I think it's more like some nasty Windows errata that the VM might want to work around. Certainly looking at the plugin code myself, I don't see anything suspicious, it must be that send()
is just agreeing to something that it really shouldn't. And yet it also seems like, if the other end doesn't close, the data does all get sent with just that one send()
call...what in the world is Windows doing?
So. Effectively this test is expected-failure on Windows, in that I don't know how to make the behavior it's testing work on Windows. It's not a regression from earlier Pharo versions, the test itself is new (my doing). Thoughts on how to proceed?
There’s a bug when setting SO_SNDBUF
to a value with more than 4 digits, try:
(9999 to: 10000) collect: [ :value |
(Socket new
setOption: 'SO_SNDBUF' value: value;
getOption: 'SO_SNDBUF') second ]
I get #(9999 131072)
on macOS, and #(9999 808464433)
on Windows. This seems to be related to the condition on line 1539 of ‘SocketPluginImpl.c’ as of Pharo VM commit 32b2be55c1.
@Rinzwind Ouch, yes, I actually looked at the plugin code and was suspicious of those lines, but couldn't understand them well enough to specifically point to a bug. Looking at it now, yeah, wow, that line is completely confusing "length of a string" and "byte size of a type". Most 32-bit integers are longer than four decimal digits! (Of course strtol
returns long
, which might be 64 bits, but the same applies to 8 decimal digits not covering the 64-bit int range—in fact 8 decimal digits doesn't even cover 32!). As a stopgap measure, it might make sense to just remove that first condition, and simply check if the string is exclusively digits at any length. We would also technically need to check over/underflow as 32 characters can exceed even a 64-bit int.
Of course the right solution, as implied in the comments in the plugin code, is for the image to pass the values as their actual types, the primitive to convert them to appropriate encoding in a void*
pointer, and pass that to the sqSocketSetOptions
implementation which would use it pretty much as-is. Honestly this seems not especially hard? But I don't have the ability to compile the VM (I tried on a brand-new Debian 11 VM and the result segfaulted precisely when using sockets, go figure) and don't know quite enough C to go it alone.
But then this starts sounding an awful lot like what the FFI marshaling layer is already doing, which really brings me back around to a pet peeve of mine, that being, why is so much of the sockets implementation in primitives that are just thin wrappers over syscalls? I know we need a tiny bit of abstraction to handle different platforms, but that could be handled by switching to different helper classes or something in the image, with just a very very few primitives to handle the parts that absolutely cannot be done any other way (I'm thinking some of the stuff with aio
in the name may qualify here, but I'd still want to keep as much as possible in the image).
In any case, we still have the weird behavior in Windows that we can have SO_SNDBUF
at its default of 65536
, send a multi-megabyte message, and Windows will happily accept the entire thing in one chunk (send(2)
will return the full length you give it) and actually succeed in sending it, assuming the socket is not closed in the meantime. Where the hell is it putting all that data? Isn't this what the send buffer size is supposed to mean?
Another approach to solving the bug could be to change the condition on line 1539 to opt->isint
with isint
being a new field that indicates whether the option expects an int
or not. The table ‘Socket-Level Options’ in section ‘2.10.16 Use of Options’ in the System Interfaces volume of POSIX.1-2017 has a column ‘Parameter Type’.
As for the Windows behavior, that’s not a newly introduced problem that might require reverting recent changes as far as I understand, so you may need to make #testFlushLargeMessageOtherEndClosed
signal TestSkipped for now when running on Windows or change it to test the ‘current’ rather than ‘desired’ behavior with respect to the signaling of ConnectionClosed.
Yes, explicitly specifying the type is clearly the correct behavior—that's more in line with what I was thinking of in terms of refactoring so the image passes the parameters as their correct type rather than strings. The declaration on the C side would still be needed for getsockopt()
, in order to correctly interpret the return value, and we could use it for error-checking—if the image passes the incorrect type, fail the primitive rather than feeding garbage to setsockopt()
.
Makes sense re: skipping the test on Windows. Just did that.
The last build failed due to issue #16443. That seems to have been fixed so another build could be started but its Windows test run will probably fail now due to issue #16450.
@jecisc No new build was started. I think you need to wait a bit longer for Jenkins to have processed the ‘REMOVED event’ before reopening.
I tried it in pull request #16449. I first closed and immediately reopened it. Then I closed it again and waited a bit longer before I reopened it. A new build was only started the second time. In the ‘Multibranch Pipeline Events’ log you can see how the events were processed, note how the processing of the first ‘REMOVED event’ started at 22:24:40 which is after the timestamp 22:24:38 of the first ‘UPDATED event’ so the pull request was already reopened:
[Fri Apr 12 22:24:40 CEST 2024] Received Pull request #16449 closed […] REMOVED event […] with timestamp Fri Apr 12 22:24:35 CEST 2024
[…]
Getting remote pull request #16449...
[…]
No changes detected: PR-16449 […]
[…]
1 pull requests were processed
[…]
[Fri Apr 12 22:24:42 CEST 2024] Pull request #16449 closed […] REMOVED event […] processed in 1.5 sec
[Fri Apr 12 22:24:43 CEST 2024] Received Pull request #16449 reopened […] UPDATED event […] with timestamp Fri Apr 12 22:24:38 CEST 2024
[…]
Getting remote pull request #16449...
[…]
No changes detected: PR-16449 […]
[…]
1 pull requests were processed
[…]
[Fri Apr 12 22:24:45 CEST 2024] Pull request #16449 reopened […] UPDATED event […] processed in 1.3 sec
[Fri Apr 12 22:25:26 CEST 2024] Received Pull request #16449 closed […] REMOVED event […] with timestamp Fri Apr 12 22:25:21 CEST 2024
[…]
Getting remote pull request #16449...
0 pull requests were processed
[…]
[Fri Apr 12 22:25:27 CEST 2024] Pull request #16449 closed […] REMOVED event […] processed in 0.83 sec
[Fri Apr 12 22:25:39 CEST 2024] Received Pull request #16449 reopened […] UPDATED event […] with timestamp Fri Apr 12 22:25:34 CEST 2024
[…]
Getting remote pull request #16449...
[…]
Pull Request reopened: PR-16449 […]
Scheduled build for branch: PR-16449
[…]
1 pull requests were processed
[…]
[Fri Apr 12 22:25:40 CEST 2024] Pull request #16449 reopened […] UPDATED event […] processed in 1.5 sec
Tests passed! Except a known random failure.
Should we integrate?
and now we have a conflict :P
Right, I remember seeing the issue whose fix created that conflict. As it happens I have some further refinement, but in any case, conflict resolved.
It's green!
There are no test results for Windows in the most recent build’s results though. The output of the ‘Allocate node’ step for the Windows test run in that build (build 14) shows ‘pharo-ci-jenkins2-win10’ was used (which doesn’t actually run the tests, see issue #15105):
Running on pharo-ci-jenkins2-win10 in C:\builds\workspace\est_and_branch_Pipeline_PR-16172