go-coap icon indicating copy to clipboard operation
go-coap copied to clipboard

Could not hear from anything from one client after server running for a long time

Open swechencheng opened this issue 4 years ago • 5 comments

Hi,

I am working on a project which lets our CoAP server (based on your go-coap) talk with some (24) portable devices with tinydtls via a tunnel. We have observed very strange problem as the title says. The server became deaf on one particular device that is retrying to request for new DTLS connection to the server when the device realise that there is no response from the server. Here is the log: https://klistra.in/1joTmxz6 The client is called D00124B0023767B6E with address [fd00:200b:a013:0:212:4b00:2376:7b6e]:61616. We have a mux handler called handlePing which is the last thing the server hears and reacted on the client. Then the server became deaf on this client.

Our server is based on a slightly modified go-coap module: https://github.com/rtechencheng/go-coap It's 2 commits ahead of your master branch. One commit is just to add stack printing for debugging purpose. Another commits is to drop the clientConn when s.connection.ReadWithContext(s.Context(), readBuf) fails. The reason why we drop it is that we can see your code is close the session but not close the client connection. Maybe it is safer also drop the client connection like what you do in client connection Process() while writeMessage() fails. The server handlePing is:

func HandlePing(w mux.ResponseWriter, r *mux.Message) {
	deviceId := getDeviceId(w.Client().RemoteAddr())
	logging.LogDebugf("handlePing set response to %s", deviceId)
	err := w.SetResponse(codes.Created, message.AppCBOR, nil)
	if err != nil {
		logging.LogErrorf("cannot set ping response to %s: %v", deviceId, err)
	}
}

Our server configuration:

// Middleware function, which will be called for each request.
func loggingMiddleware(next mux.Handler) mux.Handler {
	return mux.HandlerFunc(func(w mux.ResponseWriter, r *mux.Message) {
		addr := w.Client().RemoteAddr()
		logging.LogInfof(
			"Incoming request from %s: %v, %s",
			getDeviceId(addr),
			addr,
			r.String(),
		)
		next.ServeCOAP(w, r)
	})
}

// Create another main coap serving function than the default coap.ListenAndServeDTLS
func listenAndServeDTLS(network string, addr string, config *piondtls.Config, handler mux.Handler) error {
	l, err := net.NewDTLSListener(network, addr, config)
	if err != nil {
		return err
	}
	defer l.Close()
	var checkCloseWg sync.WaitGroup
	defer checkCloseWg.Wait()
	s := dtls.NewServer(
		dtls.WithMux(handler),
		dtls.WithOnNewClientConn(func(cc *client.ClientConn, tlscon *piondtls.Conn) {
			checkCloseWg.Add(1)
			cc.AddOnClose(func() {
				checkCloseWg.Done()
			})
		}),
		dtls.WithInactivityMonitor(2*time.Hour, func(cc inactivity.ClientConn) { // TODO: make timeout configurable
			cc.Close()
		}),
		dtls.WithErrors(func(err error) {
			logging.LogErrorf(err.Error())
		}),
		dtls.WithTransmission(2*time.Second, 10*time.Second, 3), // Slows down the server a bit
	)
	return s.Serve(l)
}

func StartCoapServer(config ServerConfig, handler ForwardingHandler) error {
	dtlsPskByteArray, err := hex.DecodeString(config.DtlsPsk)
	if err != nil {
		logging.LogErrorf("unable to convert hex to byte: %s", err)
	}

	m := mux.NewRouter()
	m.Use(loggingMiddleware)
	m.DefaultHandle(handler)
	m.Handle("ping", mux.HandlerFunc(HandlePing))
	m.Handle("p", mux.HandlerFunc(HandlePing))

	return listenAndServeDTLS("udp", ":61616", &piondtls.Config{
		PSK: func(hint []byte) ([]byte, error) {
			fmt.Printf("Client's hint: %s \n", hint)
			if config.DtlsIdentity != string(hint) {
				// Fail the handshake if identity is wrong
				return []byte{}, nil
			}
			return dtlsPskByteArray, nil
		},
		// Don't provide any PSKIdentityHint
		CipherSuites: []piondtls.CipherSuiteID{piondtls.TLS_PSK_WITH_AES_128_CCM_8},
	}, m)
}

We need to run for hours to catch this bug. We don't know what is the status of the server while it becomes deaf on the client. We did also try based on your v2.4.0 and that is worse because we can see multiple devices dropped (not able to be heard) from the server. So this server is based on your master with improvement in ctx and message pool handling.

Do you have any idea how I can further debug this? Thanks in advance!

Regards, Chencheng

swechencheng avatar Sep 22 '21 14:09 swechencheng

Also a memory map on the machine regarding the server process (not sure if it is helpful):

# cat /proc/715/maps
4000000000-4004000000 rw-p 00000000 00:00 0
558b560000-558b6a4000 r-xp 00000000 00:13 144                            /usr/bin/coap-server
558b6bd000-558b750000 r--p 0014d000 00:13 144                            /usr/bin/coap-server
558b750000-558b752000 rw-p 001e0000 00:13 144                            /usr/bin/coap-server
558b752000-558b753000 rw-p 00000000 00:00 0
55c8119000-55c813a000 rw-p 00000000 00:00 0                              [heap]
7f50000000-7f50021000 rw-p 00000000 00:00 0
7f50021000-7f54000000 ---p 00000000 00:00 0
7f58000000-7f58021000 rw-p 00000000 00:00 0
7f58021000-7f5c000000 ---p 00000000 00:00 0
7f5c000000-7f5c021000 rw-p 00000000 00:00 0
7f5c021000-7f60000000 ---p 00000000 00:00 0
7f60000000-7f60021000 rw-p 00000000 00:00 0
7f60021000-7f64000000 ---p 00000000 00:00 0
7f64000000-7f64021000 rw-p 00000000 00:00 0
7f64021000-7f68000000 ---p 00000000 00:00 0
7f68000000-7f68021000 rw-p 00000000 00:00 0
7f68021000-7f6c000000 ---p 00000000 00:00 0
7f6cd00000-7f6cd10000 ---p 00000000 00:00 0
7f6cd10000-7f6d7b0000 rw-p 00000000 00:00 0
7f6d7b0000-7f6d7c0000 ---p 00000000 00:00 0
7f6d7c0000-7f6dfc0000 rw-p 00000000 00:00 0
7f6dfc0000-7f6dfd0000 ---p 00000000 00:00 0
7f6dfd0000-7f6e7d0000 rw-p 00000000 00:00 0
7f6e7d0000-7f6e7e0000 ---p 00000000 00:00 0
7f6e7e0000-7f6efe0000 rw-p 00000000 00:00 0
7f6efe0000-7f6eff0000 ---p 00000000 00:00 0
7f6eff0000-7f6f7f0000 rw-p 00000000 00:00 0
7f6f7f0000-7f6f800000 ---p 00000000 00:00 0
7f6f800000-7f70000000 rw-p 00000000 00:00 0
7f70000000-7f70021000 rw-p 00000000 00:00 0
7f70021000-7f74000000 ---p 00000000 00:00 0
7f74018000-7f740b8000 rw-p 00000000 00:00 0
7f740b8000-7f740c8000 ---p 00000000 00:00 0
7f740c8000-7f76bf9000 rw-p 00000000 00:00 0
7f76bf9000-7f76c79000 ---p 00000000 00:00 0
7f76c79000-7f76c7a000 rw-p 00000000 00:00 0
7f76c7a000-7f96c09000 ---p 00000000 00:00 0
7f96c09000-7f96c0a000 rw-p 00000000 00:00 0
7f96c0a000-7f9abfb000 ---p 00000000 00:00 0
7f9abfb000-7f9abfc000 rw-p 00000000 00:00 0
7f9abfc000-7f9b3f9000 ---p 00000000 00:00 0
7f9b3f9000-7f9b3fa000 rw-p 00000000 00:00 0
7f9b3fa000-7f9b4f9000 ---p 00000000 00:00 0
7f9b4f9000-7f9b559000 rw-p 00000000 00:00 0
7f9b559000-7f9b561000 r-xp 00000000 b3:02 656                            /lib/libcap.so.2.48
7f9b561000-7f9b570000 ---p 00008000 b3:02 656                            /lib/libcap.so.2.48
7f9b570000-7f9b571000 r--p 00007000 b3:02 656                            /lib/libcap.so.2.48
7f9b571000-7f9b572000 rw-p 00008000 b3:02 656                            /lib/libcap.so.2.48
7f9b572000-7f9b596000 r-xp 00000000 b3:02 3997                           /usr/lib/liblzma.so.5.2.5
7f9b596000-7f9b5a5000 ---p 00024000 b3:02 3997                           /usr/lib/liblzma.so.5.2.5
7f9b5a5000-7f9b5a6000 r--p 00023000 b3:02 3997                           /usr/lib/liblzma.so.5.2.5
7f9b5a6000-7f9b5a7000 rw-p 00024000 b3:02 3997                           /usr/lib/liblzma.so.5.2.5
7f9b5a7000-7f9b5ae000 r-xp 00000000 b3:02 687                            /lib/librt-2.33.so
7f9b5ae000-7f9b5bd000 ---p 00007000 b3:02 687                            /lib/librt-2.33.so
7f9b5bd000-7f9b5be000 r--p 00006000 b3:02 687                            /lib/librt-2.33.so
7f9b5be000-7f9b5bf000 rw-p 00007000 b3:02 687                            /lib/librt-2.33.so
7f9b5bf000-7f9b5d9000 r-xp 00000000 b3:02 683                            /lib/libpthread-2.33.so
7f9b5d9000-7f9b5e8000 ---p 0001a000 b3:02 683                            /lib/libpthread-2.33.so
7f9b5e8000-7f9b5e9000 r--p 00019000 b3:02 683                            /lib/libpthread-2.33.so
7f9b5e9000-7f9b5ea000 rw-p 0001a000 b3:02 683                            /lib/libpthread-2.33.so
7f9b5ea000-7f9b5ee000 rw-p 00000000 00:00 0
7f9b5ee000-7f9b5f1000 r-xp 00000000 b3:02 659                            /lib/libdl-2.33.so
7f9b5f1000-7f9b600000 ---p 00003000 b3:02 659                            /lib/libdl-2.33.so
7f9b600000-7f9b601000 r--p 00002000 b3:02 659                            /lib/libdl-2.33.so
7f9b601000-7f9b602000 rw-p 00003000 b3:02 659                            /lib/libdl-2.33.so
7f9b602000-7f9b759000 r-xp 00000000 b3:02 651                            /lib/libc-2.33.so
7f9b759000-7f9b769000 ---p 00157000 b3:02 651                            /lib/libc-2.33.so
7f9b769000-7f9b76c000 r--p 00157000 b3:02 651                            /lib/libc-2.33.so
7f9b76c000-7f9b76f000 rw-p 0015a000 b3:02 651                            /lib/libc-2.33.so
7f9b76f000-7f9b777000 rw-p 00000000 00:00 0
7f9b777000-7f9b83c000 r-xp 00000000 b3:02 696                            /lib/libsystemd.so.0.30.0
7f9b83c000-7f9b84b000 ---p 000c5000 b3:02 696                            /lib/libsystemd.so.0.30.0
7f9b84b000-7f9b84f000 r--p 000c4000 b3:02 696                            /lib/libsystemd.so.0.30.0
7f9b84f000-7f9b850000 rw-p 000c8000 b3:02 696                            /lib/libsystemd.so.0.30.0
7f9b850000-7f9b851000 rw-p 00000000 00:00 0
7f9b851000-7f9ccbf000 r-xp 00000000 00:13 121                            /usr/lib/go/pkg/linux_arm64_dynlink/libstd.so
7f9ccbf000-7f9ccd6000 ---p 0146e000 00:13 121                            /usr/lib/go/pkg/linux_arm64_dynlink/libstd.so
7f9ccd6000-7f9d541000 r--p 01475000 00:13 121                            /usr/lib/go/pkg/linux_arm64_dynlink/libstd.so
7f9d541000-7f9d5c1000 rw-p 01ce0000 00:13 121                            /usr/lib/go/pkg/linux_arm64_dynlink/libstd.so
7f9d5c1000-7f9e051000 rw-p 00000000 00:00 0
7f9e051000-7f9e075000 r-xp 00000000 b3:02 643                            /lib/ld-2.33.so
7f9e079000-7f9e07d000 rw-p 00000000 00:00 0
7f9e07f000-7f9e081000 rw-p 00000000 00:00 0
7f9e081000-7f9e083000 r--p 00000000 00:00 0                              [vvar]
7f9e083000-7f9e084000 r-xp 00000000 00:00 0                              [vdso]
7f9e084000-7f9e086000 r--p 00023000 b3:02 643                            /lib/ld-2.33.so
7f9e086000-7f9e088000 rw-p 00025000 b3:02 643                            /lib/ld-2.33.so
7ff0002000-7ff0023000 rw-p 00000000 00:00 0                              [stack]

swechencheng avatar Sep 22 '21 14:09 swechencheng

As the time flies, there are more clients that could not reach server now: D00124B0023767B6E D00124B0023767951 D00124B002376813A Server log is here: https://klistra.in/ADdFM3qn

swechencheng avatar Sep 22 '21 20:09 swechencheng

@rtechencheng The second commit with dropping connection is not needed because the connection will be closed by deferred function in session.Run().

From my point of view, it seems that the issue is in pions/dtls package. Could you replicate the issue with logger (trace level) at the piondtls.Config? something like:

logger := logging.NewDefaultLoggerFactory()
logger.DefaultLogLevel = logging.LogLevelTrace
listenAndServeDTLS("udp", ":61616", &piondtls.Config{
		PSK: func(hint []byte) ([]byte, error) {
			fmt.Printf("Client's hint: %s \n", hint)
			if config.DtlsIdentity != string(hint) {
				// Fail the handshake if identity is wrong
				return []byte{}, nil
			}
			return dtlsPskByteArray, nil
		},
		// Don't provide any PSKIdentityHint
		CipherSuites: []piondtls.CipherSuiteID{piondtls.TLS_PSK_WITH_AES_128_CCM_8},
		LoggerFactory: logger,
	}, m)

And then pls provide the log file. Thx

jkralik avatar Sep 28 '21 18:09 jkralik

@jkralik Thanks! I have made a replication with your suggestions. I have some modification on your go-coap as well as pion/dtls due to the limitation of our client: https://github.com/rtechencheng/dtls/tree/develop (There is a removal of replay detection which reverts this commit) https://github.com/rtechencheng/go-coap/tree/develop (There is a change on UDP response. Though there is another issue that does not reply with payload, which needs to be fixed.)

The device of interest (lost) is D00124B0023767B67 (fd00:200b:a013::212:4b00:2376:7b67) Time of interest is 05:11:07 Log from server is here: https://klistra.in/SE9XW1kq Log from our tunnel regarding all IP packets is here: https://klistra.in/EwlIrRXD So you can see 7b67 is still trying to handshake with server however get no response. Also a snapshot of a wireshark capture: coap-wireshark-0929

BR/Chencheng

swechencheng avatar Sep 29 '21 07:09 swechencheng

Just to provide one more sample. Devices of interest (lost) are: D00124B00237686E2 D00124B0023767815 D00124B0023766899 D00124B00237672E3

Log of server: green-coap-server-09292200.log Log of tunnel traffic: green-br-log-09292200.log

swechencheng avatar Sep 30 '21 11:09 swechencheng

Feel free reopen if you have issue with version v3.

jkralik avatar Aug 29 '23 20:08 jkralik