libusbmuxd iproxy (or socket_send) should better guard against EAGAIN

iproxy contains a loop which forwards data between host and device. All it does is reading data from one socket and writing into another. Writing is implemented like this:

int sent = 0;
while (sent < r) {
  int s = socket_send(cdata->sfd, buffer+sent, r-sent);
  if (s <= 0) {
    break;
  }
  sent += s;
}

Where socket_send is defined like this in libimobiledevice-glue:

int socket_send(int fd, void *data, size_t length)
{
	int flags = 0;
	int res = socket_check_fd(fd, FDM_WRITE, SEND_TIMEOUT);
	if (res <= 0) {
		return res;
	}
#ifdef MSG_NOSIGNAL
	flags |= MSG_NOSIGNAL;
#endif
	int s = (int)send(fd, data, length, flags);
	if (s < 0) {
#ifdef _WIN32
		errno = WSAError_to_errno(WSAGetLastError());
#endif
		return -errno;
	}
	return s;
}

Where socket_check_fd does poll or select on fd waiting for it to become writable.

The code is reasonable: it expects that socket_check_fd completing successfully implies that send will either complete successfully or fail with some non-transient error. It turns out that this expectation does not always hold: send might still occasionally return EAGAIN even if socket_check_fd returns successfully.

This happens on Mac OS X when socket_check_fd is using select (rather than poll): select might return that fd is writable but send will fail with EAGAIN. This leads to a very confusing data loss: socket_send fails to forward a chunk of data but the socket is still perfectly fine and subsequent writes and reads from this socket succeed just fine.

poll+send does not seem to have such a bug, at least I could not reproduce it in my testing - however it might still be a good idea to guard against EAGAIN somewhere, e.g. by retrying the whole socket_check_fd and send dance.

Mar 26 '25 12:03 mraleph

Interesting observation. I would say this should go directly into the socket_send implementation. Would it be enough to just call send again when EAGAIN occurred, or should socket_check_fd also be called again?

Mar 26 '25 22:03 nikias

Normal eagain handling is that you call send again in loop, until you either success or fail with different error, of course you should put some delay between those calls.

on socket in blocking mode, EAGAIN normally doesn't appear, because you should do select first before send/receive, which will tell you if you can use socket or not, but if you use socket in noblocking mode, then select only tells if socket is useable, it does't tell you, whether is busy or not. if it's busy, send/receive will throw EAGAIN, which prompts you to retry operation.

Mar 27 '25 07:03 mexmer

one more addon, on windows sockets, not using cygwin or any posix engine, but windows sockets, equivalent of this is error WSAEWOULDBLOCK, if you receive this, you should retry operation. i have following code in one of older apps (dont't comment on code quality)

	DWORD sleepTime = 40;
	bool doAgain;
	do {
		doAgain = false;
		INT Sent = ::send(GetSocket(), (CONST CHAR*)aBuffer, (INT)aLength, 0);
		if (Sent == SOCKET_ERROR) {
			Status = ::WSAGetLastError();
			if (Status == WSAEWOULDBLOCK && sleepTime < 60 * 1000) {
				_LOGI_("[0x%p] Send() returned WSAEWOULDBLOCK trying again\n", GetSocket());
				Sleep(sleepTime);
				sleepTime += sleepTime;
				doAgain = true;
				Status = ERROR_SUCCESS;
			} else {
				_LOGE_("[0x%p] Send() failed (%ld)\n", GetSocket(), Status);
			}
		} else {
			if ((size_t)Sent != aLength)
			{
				_LOGW_("[0x%p] Send() sent only %lu(0x%lx) bytes", GetSocket(), (size_t)Sent, (size_t)Sent);
			}
			aSent = (size_t)Sent;
		}
	} while (doAgain);

Mar 27 '25 07:03 mexmer

Interesting observation. I would say this should go directly into the socket_send implementation. Would it be enough to just call send again when EAGAIN occurred, or should socket_check_fd also be called again?

I would just put a loop around send with an exponential back off similar to what @mexmer has provided.

I think fundamentally it is some sort of a kernel bug[^1] - I think it is reasonable to expect that send does not fail with EAGAIN after select has indicated that fd is writable. I have written a pure C reproduction of this erratic behavior. See this gist you can compile it with USE_SELECT and USE_POLL and observe something like this:

$ clang -o sock-test -DUSE_POLL test.c && sock-test
accepting connection
sending bytes
connected, reading bytes
^C

$ clang -o sock-test -DUSE_SELECT test.c && sock-test                                                                                                                                                                                                                                                                              accepting connection
sending bytes
connected, reading bytes
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)
!!! should not happen, but got EAGAIN from send, retrying (attempt 1)

I think the other option (in addition or maybe even instead of retrying) is to harden the loop itself against socket_send errors.

I think if any of socket_send fail the outer loop should not continue - and instead you should break out of the outer while(1) and close both fd and sfd.

[^1]: I have spent some time reading XNU source - but all I could see is that select and kqueue (on top of which poll is implemented) seemingly use almost the same code for deciding whether AF_UNIX socket is writable. The former calls sowriteable and the later make decision in filt_sowrite_common which are almost identical as if copy&pasted, but there a subtle difference - which might explain why select exhibits strange behavior and poll does not. Anyway - hard to say without putting the whole thing into a kernel debugger and it is not very easy with XNU - so whatever.

Mar 27 '25 13:03 mraleph