illumos-joyent
illumos-joyent copied to clipboard
lx: setsockopt TCP_NODELAY can return EINVAL
On illumos attempting to set TCP_NODELAY
on a connection can very occasionally return EINVAL. This is unexpected behavior on Linux, where the call is expected to succeed. Unfortunately, I'm unable to reproduce this ondemand, calls to set TCP_NODELAY
work for several hours before one fails. I've found a similar report from the JVM: JDK-6378870.
In my case, I'm running Envoy, which aborts from a failed assertion in connection_impl.cc.
The only obvious way I see this function failing is if:
439 case TCP_NODELAY:
440 if (optlen != sizeof (int32_t))
441 return (EINVAL);
the optlen is wrong, and your link shows the code to be correct in what it passes for optlen. I see nothing immediately obvious in the LX path either.
You can run this DTrace script in the global zone (send output to a file, to tee(1), or at least to a many-lines-of-memory terminal) to catch LX instances running TCP_NODELAY. It will produce some chatter for every LX-brand-zone TCP_NODELAY call. Next time you see one, find the matching-timestamp (or one that doesn't return 0) and share it from entry to exit.
#!/usr/sbin/dtrace -FCs
lx_setsockopt_tcp:entry
{
printf("%Y\n", walltimestamp);
stack();
self->trace = 1;
}
lx_setsockopt_tcp:return
/self->trace == 1/
{
printf("Returns 0x%llx", arg1);
self->trace = 0;
}
fbt:ip::entry,fbt:lx_brand::entry,fbt:sockfs::entry
/self->trace == 1/
{
}
fbt:ip::return,fbt:lx_brand::return,fbt:sockfs::return
/self->trace == 1/
{
printf("Returns 0x%llx", arg1);
}
Thanks @danmcd. I'm running the probes now, but as I haven't yet deterministically reproduced it, it might take me a few hours to see it come up.
One other silly question: Is it POSSIBLE that your function in question got passed a socket that is not AF_UNIX (which returns) AND is not AF_INET or AF_INET6 (which are the two legitimate types for TCP_NODELAY)?
Code example, annotated:
// Don't set NODELAY for unix domain sockets
sockaddr addr;
socklen_t len = sizeof(addr);
int rc = getsockname(fd_, &addr, &len);
RELEASE_ASSERT(rc == 0);
if (addr.sa_family == AF_UNIX) { // KEBE ASKS -> do you know it's always AF_INET/INET6 ?!?
return;
}
I wouldn't expect sockets to make it to noDelay
without being AF_INET
, AF_INET6
, or AF_UNIX
. A Connection is created by a Listener, and Listener has a switch for those specific address families:
switch (ss.ss_family) {
case AF_INET: {
RELEASE_ASSERT(ss_len == 0 || ss_len == sizeof(sockaddr_in));
const struct sockaddr_in* sin = reinterpret_cast<const struct sockaddr_in*>(&ss);
ASSERT(AF_INET == sin->sin_family);
return std::make_shared<Address::Ipv4Instance>(sin);
}
case AF_INET6: {
RELEASE_ASSERT(ss_len == 0 || ss_len == sizeof(sockaddr_in6));
const struct sockaddr_in6* sin6 = reinterpret_cast<const struct sockaddr_in6*>(&ss);
ASSERT(AF_INET6 == sin6->sin6_family);
return std::make_shared<Address::Ipv6Instance>(*sin6);
}
case AF_UNIX: {
const struct sockaddr_un* sun = reinterpret_cast<const struct sockaddr_un*>(&ss);
ASSERT(AF_UNIX == sun->sun_family);
RELEASE_ASSERT(ss_len == 0 || ss_len >= offsetof(struct sockaddr_un, sun_path) + 1);
return std::make_shared<Address::PipeInstance>(sun);
}
default:
throw EnvoyException(fmt::format("Unexpected sockaddr family: {}", ss.ss_family));
}
Thanks for the context. Had to ask.
6 -> lx_setsockopt_tcp 2017 Aug 25 07:41:06
lx_brand`lx_setsockopt+0x315
lx_brand`lx_syscall_enter+0x16f
unix`sys_syscall+0x142
6 | lx_setsockopt_tcp:entry
6 -> lx_sockopt_lookup
6 <- lx_sockopt_lookup Returns 0x1
6 -> socket_setsockopt
6 -> so_setsockopt
6 <- so_setsockopt Returns 0x16
6 <- socket_setsockopt Returns 0x16
6 <- lx_setsockopt_tcp Returns 0x16
This is the only one that fails, and corresponds to the abort inside the zone.
It's noticeable, even to this untrained eye, that this never calls into tcp_setsockopt
from so_setsockopt
.
WONDERFUL. This confirms a phenomenon that @pfmooney brought to my attention. Look in so_setsockopt():
792 /* X/Open requires this check */
793 if (so->so_state & SS_CANTSENDMORE && !xnet_skip_checks) {
794 SO_UNBLOCK_FALLBACK(so);
795 if (xnet_check_print)
796 printf("sockfs: X/Open setsockopt check => EINVAL\n");
797 return (EINVAL);
798 }
So the question is this: How bug-for-bug compatible with Linux do we want to be in the face of this X/Open compliance?
After expanding the macros, that was the only section that stuck out to me as well.
I haven't confirmed the state of the socket, but it does seem a bit reasonable we'll be getting this bug in this case when we can't send more data to the peer.
As you've probably already noticed, in the Envoy source, there's a check for EINVAL
on macOS. I dug around in XNU, and found they also have SS_CANTSENDMORE
check as illumos, in sosetsockopt
.
if ((so->so_state & (SS_CANTRCVMORE | SS_CANTSENDMORE)) ==
(SS_CANTRCVMORE | SS_CANTSENDMORE) &&
(so->so_flags & SOF_NPX_SETOPTSHUT) == 0) {
/* the socket has been shutdown, no more sockopt's */
error = EINVAL;
goto out;
}
I don't see a similar check in Linux's sock_setsockopt.
I can modify the DTrace script to confirm that's precisely the problem, if you wish. Or you can use echo "xnet_check_print/W1 | mdb -k"
in the global zone to enable the kernel printf for this case (but it may be printing more noise than you want).
Otherwise, yeah, it becomes a how-bug-for-bug compatible do we wish to be? (Also, I take it you can't run this natively? :) )
Looking at BSD, I can't find a similar check in FreeBSD or NetBSD. But it looks like it exists in OpenBSD, though It's moved inside the SO_SNDBUF
and SO_RCVBUF
option cases.
I can modify the DTrace script to confirm that's the problem, if you wish.
I don't think that's necessary, at least not right now.
Also, I take it you can't run this natively?
I'd like to, but first I'd have to teach the build tool Bazel about other Unixen. Earlier this week I started down this road, but opted to run in an LX zone instead because I thought it would take too long to finish. :grimacing:
I opened OS-6312 to track this.
@danmcd For what its worth; Like @terinjokes we also spend a couple of days at running Envoy native on SmartOS / Illumos and also ended up with LX. Envoy is basically a really nice gRPC reverse proxy.
@jperkin I can't really say how big the community interest is but there seems to be atleast some gRPC community buzz around Envoy (the official gRPC docs also reference Envoy a couple of times); Might be worth looking into packaging for pkgsrc / SmartOS.
@siepkes I haven't tried to upstream them, because I don't think they'll be accepted, but I have some patches to make Envoy work in LX zones for me. See https://git.terinstock.com/#/q/status:open+project:envoy
As for packaging it for native zones, Envoy uses Bazel, which has no concept of Solaris, illumos, or SmartOS, so it's probably a bit more work than a downstream packaging maintainer wants to bite off for now. (I checked out Bazel locally and been trying to create patches for adding illumos, but I don't fully understand the internals yet).