modbus TCP: better handling of server-closed connections

In https://github.com/evcc-io/evcc/discussions/12852#discussioncomment-12606558, we're having problems with some modbus devices that have issues on ModbusTCP:

device is resource-limited
it quickly closes idle connections (device timeout 4s)
library will only notice the closed connection on the next send

What we notice it that it seems to take very long before the transaction is retried:

[grid1 ] TRACE 2025/03/24 16:59:06 modbus: send 00 0f 00 00 00 06 ff 03 04 36 00 01
[grid1 ] TRACE 2025/03/24 16:59:06 modbus: close connection and retry, because of EOF
[grid1 ] TRACE 2025/03/24 16:59:21 modbus: send 00 0f 00 00 00 06 ff 03 04 36 00 01

Does anyone have similar issues or experience with ModbusTCP against low-spec embedded devices? Are the best practices for (timeout) settings to improve handling dropped connections?

Mar 24 '25 19:03 andig

I've added a bit more logging and found that, when the server closes the connection, then:

the send on the (already closed) connection seems to succeed (not sure why this happens- TCP FIN/ACK/RST magic?)
the response read fails in readResponse with EOF
this causes readResponse to return readResultCloseRetry, which leads to time.Sleep(mb.LinkRecoveryTimeout)

Imho, in the case of EOF, readResponse should return readResultRetry and connection being immediately retried.

As workaround, I've solved this by setting mb.LinkRecoveryTimeout to 0.

Mar 25 '25 15:03 andig

To be more specific- since readResultRetry applies to retrying without closing the connection- we should probably do something like:

mb.logf("modbus: close connection and retry, because of %v", err)

mb.close()
if err != io.EOF {
    // make timeout conditional
    time.Sleep(mb.LinkRecoveryTimeout)
}

Mar 25 '25 15:03 andig

I think that'd be a good idea. If the other end has closed the connection, we should probably retry connecting ASAP.

I remember observing writes succeeding on raw TCP connections too, even after the connection was already closed on the other end. I found a lengthy explanation about it on StackOverflow and well as in the golang-nuts group, but haven't had time to dig deeper.

Mar 28 '25 12:03 antoineco

I've just hit another instance- the infamous connection reset by peer/broken pipe: https://gosamples.dev/connection-reset-by-peer/.

When this happens, reading the response will fail in readResponse:

if _, err = io.ReadFull(mb.conn, data[:tcpHeaderSize]); err == nil {
        ...
} else if (err != io.EOF && err != io.ErrUnexpectedEOF) ||
	mb.LinkRecoveryTimeout == 0 || time.Until(recoveryDeadline) < 0 {
	return
}

Since this returns readResultDone we'll just return the error and do nothing else, especially not closing the now-broken connection:

switch res {
case readResultDone:
	if err == nil {
		mb.lastSuccessfulTransactionID = binary.BigEndian.Uint16(aduResponse)
	}
	return
case readResultRetry:
	continue
}

In turn, the following transaction will fail too (this time with a timeout) since it operates on a dead connection.

My feeling at this stage is, that the entire TCP connection handling is faulty, based on false assumptions and generally more complex than providing additional value. I'll prototype simply closing the connection on any read/write errors and keeping retries only for logical errors (maybe not even that...).

Apr 02 '25 16:04 andig

Here's the trimmed version I'll try validating: https://github.com/evcc-io/modbus/pull/14


// Send sends data to server and ensures response length is greater than header length.
func (mb *tcpTransporter) Send(aduRequest []byte) ([]byte, error) {
	mb.mu.Lock()
	defer mb.mu.Unlock()

	var data [tcpMaxLength]byte

	// Establish a new connection if not connected
	if err := mb.connect(); err != nil {
		return nil, err
	}

	// Set timer to close when idle
	mb.lastActivity = time.Now()
	mb.startCloseTimer()

	// Set write and read timeout
	if mb.Timeout > 0 {
		if err := mb.conn.SetDeadline(mb.lastActivity.Add(mb.Timeout)); err != nil {
			return nil, err
		}
	}

	close := func(err error) {
		mb.logf("modbus: close connection: %v", err)
		mb.close()
	}

	// Send data
	mb.logf("modbus: send % x", aduRequest)
	if _, err := mb.conn.Write(aduRequest); err != nil {
		close(err)
		return nil, err
	}

	res, err := mb.readResponse(aduRequest, data[:])
	if err != nil {
		close(err)
	}

	return res, err
}

func (mb *tcpTransporter) readResponse(aduRequest []byte, data []byte) ([]byte, error) {
	_, err := io.ReadFull(mb.conn, data[:tcpHeaderSize])
	if err != nil {
		return nil, err
	}

	aduResponse, err := mb.processResponse(data[:])
	if err != nil {
		return nil, err
	}

	err = verify(aduRequest, aduResponse)
	if err == nil {
		mb.logf("modbus: recv % x\n", aduResponse)
	}

	return aduResponse, err
}

Maybe one of the maintainers could comment why the TCP implementation originally contained this complexity at all? I'l half-way confident that a simple (external) backoff-based retry will perform just as expected.

Apr 02 '25 16:04 andig

I'm afraid none of the original authors is still involved with this project, however we are a few people in the grid-x organization who will take the time to review those changes if you want to submit them.

Apr 03 '25 10:04 antoineco

I have unfortunately lost the ability to push PRs when we've added the capability to share the connection for multiple clients for evcc. The change above is breaking but straightforward. If there's desire to include it I can do a PR manually.

First user feedback is positive- the simple external retry works much better and especially faster than the fine-tuned current logic.

Apr 04 '25 15:04 andig

(Sorry I pressed the wrong button, didn't mean to close.)

A lot of the complexity you see in this client was built upon real issues seen in the field, particularly with inverters, one patch after another:

https://github.com/grid-x/modbus/commit/07ce43b97109f5ae583a6f661b47e592a0324b34
https://github.com/grid-x/modbus/commit/33425482d645cbbecb598febbfe979fe609c3486
https://github.com/grid-x/modbus/pull/49
https://github.com/grid-x/modbus/pull/51

If we were to revert all of that like you did, multiple integrations would suddenly break at gridX. Ideally we would just refuse to support devices with poorly implemented Modbus interfaces and be able to keep this code much simpler, but we just cannot.

I would be suportive of simplifying/improving the handling of network errors, but the revert of all that previous maintainers have built before to support edge cases will not happen in this codebase.

Apr 04 '25 18:04 antoineco

Hey EVCC dev team!

Thank you very much for updating the handling of TCP/IP connections and the error messages.

Just wanted to confirm that this is an issue with "cheap" wallboxes. I for example have a Webasto NEXT (now Ampure Next). A quite cheap wallbox with a very slow processor. The web interface is super slow. Hardly usable. Ok, it does work.

I had a lot of modbus problems. Invested a huge amount into debugging. Weeks and weeks. Non deterministic connection abruptions are hard to tackle. In the end I found out that it's not even modbus, but just plain simply the TCP/IP stack of the wallbox (on the modbus port 502) that breaks down when called "too often" which already is more than once per second.

for i in {1..10}; do echo "Try $i"; nc -zv <IP_OF_YOUR_WALLBOX> 502; sleep 1; done
Try 1
Connection to 192.168.178.24 502 port [tcp/*] succeeded!
Try 2
Connection to 192.168.178.24 502 port [tcp/*] succeeded!
Try 3
<===== crash / timeout

then wallbox needs to be restarted.

Apr 20 '25 17:04 Doogiemuc

Think you‘ve commenged in the wrong repo ☝🏻

Apr 20 '25 18:04 andig

@antoineco thanks for the shared context, especially https://github.com/grid-x/modbus/commit/07ce43b97109f5ae583a6f661b47e592a0324b34. It seems that the simplistic "close the connection on error" approach I'm trying to take solves each of the linked issues, if at the cost of missing and needing to retry a request. The SolarEdge issue sounds like it might not be solved by it, but at evcc we've not yet seen it in the wild.

That said, I'd be happy for any other solution to "server side closed connection" that might be less invasive. Anything I've tried did fail in the one of other way and was generally much slower to recover if at all.

If we were to revert all of that like you did, multiple integrations would suddenly break at gridX.

I wouldn't expect that of course. I still think there's missing piece of better handling here. A good example btw is the Varta battery which closes the connection after 4s. That should be an example that one could also build a good test case for?

Apr 21 '25 11:04 andig