LibWeb: The Worker tests frequently time out on macOS
For example:
worker-blob: https://github.com/LadybirdBrowser/ladybird/actions/runs/10653904533/job/29532039538 worker-crypto: https://github.com/LadybirdBrowser/ladybird/actions/runs/11094612138/job/30822271449 worker-location: https://github.com/LadybirdBrowser/ladybird/actions/runs/10744279351/job/29800858575
These tests were disabled in:
- https://github.com/LadybirdBrowser/ladybird/pull/1305
- https://github.com/LadybirdBrowser/ladybird/pull/1567
- https://github.com/LadybirdBrowser/ladybird/pull/1627
I ran into this while running the Text/input/HTML/DedicatedWorkerGlobalScope-instanceof.html test locally. I ended up doing a deep dive into what is happening over on the Discord. Here's what I found:
- A
WebContentthread receives a request to start a Worker - It makes a new
Workerinstance - That worker creates two
MessagePorts - The
MessagePorts are entangled and adopt fds 13 and 14 (they're always consistent when running one test) - A
WebWorkerprocess is started via anIPC::Connection(fd 15) - One of the
MessagePorts (fd 13) is transferred to theWebWorkerprocess via thesendmsgsyscall. Here is an overview trace of the process:- A new
HTML::TransferDataHolderis created transfer_steps()is called on theMessagePorttransfer_steps()moves thefdinto anIPC::Filein theHTML::TransferDataHolderMessagePort::send_message_on_socketis then called, which in turn callsIPC::Encoder::encode()- That ends up calling
ErrorOr<void> encode(Encoder& encoder, File const& file)inLibIPC/Encoder.cpp - That
encode()overload moves the file descriptor (fd 13) into theEncoder's internalm_bufferviaappend_file_descriptor, which is an instance ofMessageBuffer - The call to
MessageBuffer::append_file_descriptorwraps the fd (13) in anAutoCloseFileDescriptor MessageBuffer::transfer_messageis then called with theConnection's fd (15)- That ends up calling
LocalSocket::send_message - Which then calls
System::sendmsgwithcmsg_level = SOL_SOCKETandcmsg_type = SCM_RIGHTS
- A new
- While that data is in flight, the
AutoCloseFileDescriptorcreated in step 6.vii falls out of scope, callingclose(13). This is the source of the bug. - The
WebWorkerthen receives the sent fd viarecvmsg(insideLocalSocket::receive_message) as fd 8 - At this point, due to calling
close()beforerecvmsg, fd 8 has aSO_RCVLOWATvalue of0(normally it should be1). This value controls the minimum amount of data (in bytes) that must be present in the stream for it to be considered "readable". A value of1waits for some data, but a value of0means "always treat this stream as readable even if there is nothing it it". AdditionallySO_RCVBUFis also0. - fd 8 is then passed to a new
MessagePortin theWebWorkerviaMessagePort::transfer_receiving_steps - Inside
MessagePort::transfer_receiving_stepsit callsCore::LocalSocket::adopt_fd()which itself creates and registers aNotifierfor fd 8 withEventLoopManagerUnix EventLoopManagerUnixpolls with no timeout (0), and getsPOLLINback on fd 8, so notifies that there is content to read.- The
MessagePorttries to read
If the parent thread (WebContent) managed to send a message by this point, then the test will (likely) pass. However if it did not, then the WebWorker thread will read no data, thus triggering PosixSocketHelper::did_reach_eof_on_read, which disables the notifier. Now even if the WebContent thread writes to the socket the WebWorker is no longer listening, and we get a hang.
I have reported this to Apple as a bug, but it looks like they've known about this since 2011, so seems unlikely it will be fixed.
Solutions I have already tried that did not work:
- Set
SO_RCVLOWATandSO_RCVBUFback to their expected values before the call toLocalSocket::adopt_fd - Set
SO_RCVLOWATandSO_RCVBUFand also do a fakeread()and/orpoll()to try to flush whatever bad cache is going on dup()the receivedfd- Delaying the
close()in theAutoCloseFileDescriptordestructor by wrapping the it inWeb::Platform::EventLoopPlugin::the().deferred_invoke()(couldn't get it to compile)
I think our only real remaining option is to change the fd sending logic to require an ACK of some kind before closing the socket, at least on macOS.
Thanks to several other users over on the Discord including Agni, Tim Flynn, Andrew Kaster, cv01, CxByte, and Sam Atkins for helping debug this issue.
@awesomekling had the idea of using Mach ports on macOS instead of sending file descriptors around.
Disabled Text/input/HTML/MessagePort-MessageEvents-should-be-trusted.html in #2620. Seems to be the same issue (worker messages).
Disabled Text/input/wpt-import/hr-time/timeOrigin.html in #3381, as it seems to be timing out due to this issue.
Here's my latest on trying to implement Mach Port IPC:
https://github.com/LadybirdBrowser/ladybird/compare/master...ADKaster:ladybird-browser:ipc-mach?expand=1
I got a bit lost in the sauce trying to set up the mach port handshake. Looking at how mojo does it for chromium, it seemed like having two initial states for Mach Transport was a good idea:
- Created with both a receive right to self, and a send right to the remote port
- Created only with a receive right to self.
If a port is created in state 1, it would first send a send right to self to the remote port before processing any caller messages
If a port is created in state 2, it would first receive a send right to the remote port before processing any caller messages
This handshake would allow a central arbiter like the UI process to create the "complete" connections, and the leaf processes to handle creating incomplete transport connections.
But there's a lot of unknowns with this model:
- Can you really have multiple distinct send rights to a process? Or do we need a central router for IPC connections in each process with an "IPC connection ID" to route based on?
- How well does this handle the multi-hop IPC transport handle transfer model we've adopted using
connect_new_clienttype IPC methods? - ??
Anyway, if someone else (@trflynn89 ?) could be yak-baited into picking this up, I'd be grateful 😅