gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

UDP interconnect packet lost when send EOS cause "ERROR: interconnect encountered a network error"

Open wuyuhao28 opened this issue 3 years ago • 34 comments

Bug Report

We encounters this bug when Greenplum cluster is huge and network is busy, so it‘s a bit hard to reproduce the behavior. When the problem happens, we debug the sender slice on a certain segment, and find it's stack stuck in SendEosUDPIFC(), waiting for acks from receivers, and finally will report ERROR after timeout. ERROR message is : "Failed to send packet (seq 1) to ip:50505 (pid 2126628 cid 7) after 3566 retries in 3600 seconds"; When the sender slice is waiting for acks from receivers, the receiver slice had finished it's work and states turned to 'idle'.

I think this issue is caused by different UDP send behaviors:

  1. In sendOnce(), we will check sendto() return value and retry send if necessary, see ic_udpifc.c:4552 :
xmit_retry:
	n = sendto(pEntry->txfd, buf->pkt, buf->pkt->len, 0,
			   (struct sockaddr *) &conn->peer, conn->peer_len);
	if (n < 0)
	{
		if (errno == EINTR)
			goto xmit_retry;

		if (errno == EAGAIN)	/* no space ? not an error. */
			return;

             /* ... */
       }
  1. In sendControlMessage(), we will not handle sendto() failure, see ic_udpifc.c:1778 :
static inline void
sendControlMessage(icpkthdr *pkt, int fd, struct sockaddr *addr, socklen_t peerLen)
{
	int			n;

#ifdef USE_ASSERT_CHECKING
	if (testmode_inject_fault(gp_udpic_dropacks_percent))
	{
#ifdef AMS_VERBOSE_LOGGING
		write_log("THROW CONTROL MESSAGE with seq %d extraSeq %d srcpid %d despid %d", pkt->seq, pkt->extraSeq, pkt->srcPid, pkt->dstPid);
#endif
		return;
	}
#endif

	/* Add CRC for the control message. */
	if (gp_interconnect_full_crc)
		addCRC(pkt);

	n = sendto(fd, (const char *) pkt, pkt->len, 0, addr, peerLen);

	/*
	 * No need to handle EAGAIN here: no-space just means that we dropped the
	 * packet: our ordinary retransmit mechanism will handle that case
	 */

	if (n < pkt->len)
		write_log("sendcontrolmessage: got error %d errno %d seq %d", n, errno, pkt->seq);
}

Receiver slice receive sender slice's EOS (call sendOnce() ), and send ACK (call sendControlMessage() ) back to sender slice without sendto() check. When the network is not good, it could leads to the problem that receiver will definitely receive EOS and quit Motion, but send slice cannot receive ACK , endless pollAcks and cannot quit Motion.

So, why sendControlMessage() do not check sendto() return and retry? Can we avoid this bug?

Greenplum version or build

6x_stable

OS version and uname -a

CentOS 6

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

Actual behavior

Step to reproduce the behavior

wuyuhao28 avatar Dec 28 '21 13:12 wuyuhao28