ladybird icon indicating copy to clipboard operation
ladybird copied to clipboard

LibWeb: The Worker tests frequently time out on macOS

Open trflynn89 opened this issue 1 year ago • 5 comments

For example:

worker-blob: https://github.com/LadybirdBrowser/ladybird/actions/runs/10653904533/job/29532039538 worker-crypto: https://github.com/LadybirdBrowser/ladybird/actions/runs/11094612138/job/30822271449 worker-location: https://github.com/LadybirdBrowser/ladybird/actions/runs/10744279351/job/29800858575

These tests were disabled in:

  • https://github.com/LadybirdBrowser/ladybird/pull/1305
  • https://github.com/LadybirdBrowser/ladybird/pull/1567
  • https://github.com/LadybirdBrowser/ladybird/pull/1627

trflynn89 avatar Sep 06 '24 20:09 trflynn89

I ran into this while running the Text/input/HTML/DedicatedWorkerGlobalScope-instanceof.html test locally. I ended up doing a deep dive into what is happening over on the Discord. Here's what I found:

  1. A WebContent thread receives a request to start a Worker
  2. It makes a new Worker instance
  3. That worker creates two MessagePorts
  4. The MessagePorts are entangled and adopt fds 13 and 14 (they're always consistent when running one test)
  5. A WebWorker process is started via an IPC::Connection (fd 15)
  6. One of the MessagePorts (fd 13) is transferred to the WebWorker process via the sendmsg syscall. Here is an overview trace of the process:
    1. A new HTML::TransferDataHolder is created
    2. transfer_steps() is called on the MessagePort
    3. transfer_steps() moves the fd into an IPC::File in the HTML::TransferDataHolder
    4. MessagePort::send_message_on_socket is then called, which in turn calls IPC::Encoder::encode()
    5. That ends up calling ErrorOr<void> encode(Encoder& encoder, File const& file) in LibIPC/Encoder.cpp
    6. That encode() overload moves the file descriptor (fd 13) into the Encoder's internal m_buffer via append_file_descriptor, which is an instance of MessageBuffer
    7. The call to MessageBuffer::append_file_descriptor wraps the fd (13) in an AutoCloseFileDescriptor
    8. MessageBuffer::transfer_message is then called with the Connection's fd (15)
    9. That ends up calling LocalSocket::send_message
    10. Which then calls System::sendmsg with cmsg_level = SOL_SOCKET and cmsg_type = SCM_RIGHTS
  7. While that data is in flight, the AutoCloseFileDescriptor created in step 6.vii falls out of scope, calling close(13). This is the source of the bug.
  8. The WebWorker then receives the sent fd via recvmsg (inside LocalSocket::receive_message) as fd 8
  9. At this point, due to calling close() before recvmsg, fd 8 has a SO_RCVLOWAT value of 0 (normally it should be 1). This value controls the minimum amount of data (in bytes) that must be present in the stream for it to be considered "readable". A value of 1 waits for some data, but a value of 0 means "always treat this stream as readable even if there is nothing it it". Additionally SO_RCVBUF is also 0.
  10. fd 8 is then passed to a new MessagePort in the WebWorker via MessagePort::transfer_receiving_steps
  11. Inside MessagePort::transfer_receiving_steps it calls Core::LocalSocket::adopt_fd() which itself creates and registers a Notifier for fd 8 with EventLoopManagerUnix
  12. EventLoopManagerUnix polls with no timeout (0), and gets POLLIN back on fd 8, so notifies that there is content to read.
  13. The MessagePort tries to read

If the parent thread (WebContent) managed to send a message by this point, then the test will (likely) pass. However if it did not, then the WebWorker thread will read no data, thus triggering PosixSocketHelper::did_reach_eof_on_read, which disables the notifier. Now even if the WebContent thread writes to the socket the WebWorker is no longer listening, and we get a hang.

I have reported this to Apple as a bug, but it looks like they've known about this since 2011, so seems unlikely it will be fixed.

Solutions I have already tried that did not work:

  • Set SO_RCVLOWAT and SO_RCVBUF back to their expected values before the call to LocalSocket::adopt_fd
  • Set SO_RCVLOWAT and SO_RCVBUF and also do a fake read() and/or poll() to try to flush whatever bad cache is going on
  • dup() the received fd
  • Delaying the close() in the AutoCloseFileDescriptor destructor by wrapping the it in Web::Platform::EventLoopPlugin::the().deferred_invoke() (couldn't get it to compile)

I think our only real remaining option is to change the fd sending logic to require an ACK of some kind before closing the socket, at least on macOS.


Thanks to several other users over on the Discord including Agni, Tim Flynn, Andrew Kaster, cv01, CxByte, and Sam Atkins for helping debug this issue.

thislooksfun avatar Oct 16 '24 04:10 thislooksfun

@awesomekling had the idea of using Mach ports on macOS instead of sending file descriptors around.

thislooksfun avatar Oct 16 '24 05:10 thislooksfun

Disabled Text/input/HTML/MessagePort-MessageEvents-should-be-trusted.html in #2620. Seems to be the same issue (worker messages).

AtkinsSJ avatar Nov 28 '24 15:11 AtkinsSJ

Disabled Text/input/wpt-import/hr-time/timeOrigin.html in #3381, as it seems to be timing out due to this issue.

tcl3 avatar Jan 30 '25 15:01 tcl3

Here's my latest on trying to implement Mach Port IPC:

https://github.com/LadybirdBrowser/ladybird/compare/master...ADKaster:ladybird-browser:ipc-mach?expand=1

I got a bit lost in the sauce trying to set up the mach port handshake. Looking at how mojo does it for chromium, it seemed like having two initial states for Mach Transport was a good idea:

  1. Created with both a receive right to self, and a send right to the remote port
  2. Created only with a receive right to self.

If a port is created in state 1, it would first send a send right to self to the remote port before processing any caller messages

If a port is created in state 2, it would first receive a send right to the remote port before processing any caller messages

This handshake would allow a central arbiter like the UI process to create the "complete" connections, and the leaf processes to handle creating incomplete transport connections.

But there's a lot of unknowns with this model:

  • Can you really have multiple distinct send rights to a process? Or do we need a central router for IPC connections in each process with an "IPC connection ID" to route based on?
  • How well does this handle the multi-hop IPC transport handle transfer model we've adopted using connect_new_client type IPC methods?
  • ??

Anyway, if someone else (@trflynn89 ?) could be yak-baited into picking this up, I'd be grateful 😅

ADKaster avatar Feb 18 '25 15:02 ADKaster