illumos-joyent icon indicating copy to clipboard operation
illumos-joyent copied to clipboard

lx: setsockopt TCP_NODELAY can return EINVAL

Open terinjokes opened this issue 7 years ago • 15 comments

On illumos attempting to set TCP_NODELAY on a connection can very occasionally return EINVAL. This is unexpected behavior on Linux, where the call is expected to succeed. Unfortunately, I'm unable to reproduce this ondemand, calls to set TCP_NODELAY work for several hours before one fails. I've found a similar report from the JVM: JDK-6378870.

In my case, I'm running Envoy, which aborts from a failed assertion in connection_impl.cc.

terinjokes avatar Aug 24 '17 19:08 terinjokes

The only obvious way I see this function failing is if:

439  		case TCP_NODELAY:
440  			if (optlen != sizeof (int32_t))
441  				return (EINVAL);

the optlen is wrong, and your link shows the code to be correct in what it passes for optlen. I see nothing immediately obvious in the LX path either.

danmcd avatar Aug 24 '17 19:08 danmcd

You can run this DTrace script in the global zone (send output to a file, to tee(1), or at least to a many-lines-of-memory terminal) to catch LX instances running TCP_NODELAY. It will produce some chatter for every LX-brand-zone TCP_NODELAY call. Next time you see one, find the matching-timestamp (or one that doesn't return 0) and share it from entry to exit.

#!/usr/sbin/dtrace -FCs

lx_setsockopt_tcp:entry
{
	printf("%Y\n", walltimestamp);
	stack();
	self->trace = 1;
}

lx_setsockopt_tcp:return
/self->trace == 1/
{
	printf("Returns 0x%llx", arg1);
	self->trace = 0;
}

fbt:ip::entry,fbt:lx_brand::entry,fbt:sockfs::entry
/self->trace == 1/
{
}

fbt:ip::return,fbt:lx_brand::return,fbt:sockfs::return
/self->trace == 1/
{
	printf("Returns 0x%llx", arg1);
}

danmcd avatar Aug 24 '17 19:08 danmcd

Thanks @danmcd. I'm running the probes now, but as I haven't yet deterministically reproduced it, it might take me a few hours to see it come up.

terinjokes avatar Aug 24 '17 20:08 terinjokes

One other silly question: Is it POSSIBLE that your function in question got passed a socket that is not AF_UNIX (which returns) AND is not AF_INET or AF_INET6 (which are the two legitimate types for TCP_NODELAY)?

danmcd avatar Aug 24 '17 20:08 danmcd

Code example, annotated:

// Don't set NODELAY for unix domain sockets
  sockaddr addr;
  socklen_t len = sizeof(addr);
  int rc = getsockname(fd_, &addr, &len);
  RELEASE_ASSERT(rc == 0);

  if (addr.sa_family == AF_UNIX) { // KEBE ASKS -> do you know it's always AF_INET/INET6 ?!?
    return;
}

danmcd avatar Aug 24 '17 20:08 danmcd

I wouldn't expect sockets to make it to noDelay without being AF_INET, AF_INET6, or AF_UNIX. A Connection is created by a Listener, and Listener has a switch for those specific address families:

  switch (ss.ss_family) {
  case AF_INET: {
    RELEASE_ASSERT(ss_len == 0 || ss_len == sizeof(sockaddr_in));
    const struct sockaddr_in* sin = reinterpret_cast<const struct sockaddr_in*>(&ss);
    ASSERT(AF_INET == sin->sin_family);
    return std::make_shared<Address::Ipv4Instance>(sin);
  }
  case AF_INET6: {
    RELEASE_ASSERT(ss_len == 0 || ss_len == sizeof(sockaddr_in6));
    const struct sockaddr_in6* sin6 = reinterpret_cast<const struct sockaddr_in6*>(&ss);
    ASSERT(AF_INET6 == sin6->sin6_family);
    return std::make_shared<Address::Ipv6Instance>(*sin6);
  }
  case AF_UNIX: {
    const struct sockaddr_un* sun = reinterpret_cast<const struct sockaddr_un*>(&ss);
    ASSERT(AF_UNIX == sun->sun_family);
    RELEASE_ASSERT(ss_len == 0 || ss_len >= offsetof(struct sockaddr_un, sun_path) + 1);
    return std::make_shared<Address::PipeInstance>(sun);
  }
  default:
    throw EnvoyException(fmt::format("Unexpected sockaddr family: {}", ss.ss_family));
  }

terinjokes avatar Aug 24 '17 21:08 terinjokes

Thanks for the context. Had to ask.

danmcd avatar Aug 24 '17 21:08 danmcd

  6    -> lx_setsockopt_tcp 2017 Aug 25 07:41:06

              lx_brand`lx_setsockopt+0x315
              lx_brand`lx_syscall_enter+0x16f
              unix`sys_syscall+0x142

  6    | lx_setsockopt_tcp:entry 
  6        -> lx_sockopt_lookup 
  6        <- lx_sockopt_lookup Returns 0x1
  6        -> socket_setsockopt 
  6            -> so_setsockopt 
  6            <- so_setsockopt Returns 0x16
  6        <- socket_setsockopt Returns 0x16
  6    <- lx_setsockopt_tcp Returns 0x16

This is the only one that fails, and corresponds to the abort inside the zone.

It's noticeable, even to this untrained eye, that this never calls into tcp_setsockopt from so_setsockopt.

terinjokes avatar Aug 25 '17 17:08 terinjokes

WONDERFUL. This confirms a phenomenon that @pfmooney brought to my attention. Look in so_setsockopt():

792  	/* X/Open requires this check */
793  	if (so->so_state & SS_CANTSENDMORE && !xnet_skip_checks) {
794  		SO_UNBLOCK_FALLBACK(so);
795  		if (xnet_check_print)
796  			printf("sockfs: X/Open setsockopt check => EINVAL\n");
797  		return (EINVAL);
798  	}

So the question is this: How bug-for-bug compatible with Linux do we want to be in the face of this X/Open compliance?

danmcd avatar Aug 25 '17 17:08 danmcd

After expanding the macros, that was the only section that stuck out to me as well.

I haven't confirmed the state of the socket, but it does seem a bit reasonable we'll be getting this bug in this case when we can't send more data to the peer.

As you've probably already noticed, in the Envoy source, there's a check for EINVAL on macOS. I dug around in XNU, and found they also have SS_CANTSENDMORE check as illumos, in sosetsockopt.

if ((so->so_state & (SS_CANTRCVMORE | SS_CANTSENDMORE)) ==
    (SS_CANTRCVMORE | SS_CANTSENDMORE) &&
    (so->so_flags & SOF_NPX_SETOPTSHUT) == 0) {
    /* the socket has been shutdown, no more sockopt's */
    error = EINVAL;
    goto out;
}

I don't see a similar check in Linux's sock_setsockopt.

terinjokes avatar Aug 25 '17 18:08 terinjokes

I can modify the DTrace script to confirm that's precisely the problem, if you wish. Or you can use echo "xnet_check_print/W1 | mdb -k" in the global zone to enable the kernel printf for this case (but it may be printing more noise than you want).

Otherwise, yeah, it becomes a how-bug-for-bug compatible do we wish to be? (Also, I take it you can't run this natively? :) )

danmcd avatar Aug 25 '17 18:08 danmcd

Looking at BSD, I can't find a similar check in FreeBSD or NetBSD. But it looks like it exists in OpenBSD, though It's moved inside the SO_SNDBUF and SO_RCVBUF option cases.

I can modify the DTrace script to confirm that's the problem, if you wish.

I don't think that's necessary, at least not right now.

Also, I take it you can't run this natively?

I'd like to, but first I'd have to teach the build tool Bazel about other Unixen. Earlier this week I started down this road, but opted to run in an LX zone instead because I thought it would take too long to finish. :grimacing:

terinjokes avatar Aug 25 '17 19:08 terinjokes

I opened OS-6312 to track this.

jjelinek avatar Aug 25 '17 19:08 jjelinek

@danmcd For what its worth; Like @terinjokes we also spend a couple of days at running Envoy native on SmartOS / Illumos and also ended up with LX. Envoy is basically a really nice gRPC reverse proxy.

@jperkin I can't really say how big the community interest is but there seems to be atleast some gRPC community buzz around Envoy (the official gRPC docs also reference Envoy a couple of times); Might be worth looking into packaging for pkgsrc / SmartOS.

siepkes avatar Oct 16 '17 21:10 siepkes

@siepkes I haven't tried to upstream them, because I don't think they'll be accepted, but I have some patches to make Envoy work in LX zones for me. See https://git.terinstock.com/#/q/status:open+project:envoy

As for packaging it for native zones, Envoy uses Bazel, which has no concept of Solaris, illumos, or SmartOS, so it's probably a bit more work than a downstream packaging maintainer wants to bite off for now. (I checked out Bazel locally and been trying to create patches for adding illumos, but I don't fully understand the internals yet).

terinjokes avatar Oct 16 '17 21:10 terinjokes