TCP: better handling of server-closed connections
In https://github.com/evcc-io/evcc/discussions/12852#discussioncomment-12606558, we're having problems with some modbus devices that have issues on ModbusTCP:
- device is resource-limited
- it quickly closes idle connections (device timeout 4s)
- library will only notice the closed connection on the next send
What we notice it that it seems to take very long before the transaction is retried:
[grid1 ] TRACE 2025/03/24 16:59:06 modbus: send 00 0f 00 00 00 06 ff 03 04 36 00 01
[grid1 ] TRACE 2025/03/24 16:59:06 modbus: close connection and retry, because of EOF
[grid1 ] TRACE 2025/03/24 16:59:21 modbus: send 00 0f 00 00 00 06 ff 03 04 36 00 01
Does anyone have similar issues or experience with ModbusTCP against low-spec embedded devices? Are the best practices for (timeout) settings to improve handling dropped connections?
I've added a bit more logging and found that, when the server closes the connection, then:
- the send on the (already closed) connection seems to succeed (not sure why this happens- TCP FIN/ACK/RST magic?)
- the response read fails in
readResponsewithEOF - this causes
readResponseto returnreadResultCloseRetry, which leads totime.Sleep(mb.LinkRecoveryTimeout)
Imho, in the case of EOF, readResponse should return readResultRetry and connection being immediately retried.
As workaround, I've solved this by setting mb.LinkRecoveryTimeout to 0.
To be more specific- since readResultRetry applies to retrying without closing the connection- we should probably do something like:
mb.logf("modbus: close connection and retry, because of %v", err)
mb.close()
if err != io.EOF {
// make timeout conditional
time.Sleep(mb.LinkRecoveryTimeout)
}
I think that'd be a good idea. If the other end has closed the connection, we should probably retry connecting ASAP.
I remember observing writes succeeding on raw TCP connections too, even after the connection was already closed on the other end. I found a lengthy explanation about it on StackOverflow and well as in the golang-nuts group, but haven't had time to dig deeper.
I've just hit another instance- the infamous connection reset by peer/broken pipe: https://gosamples.dev/connection-reset-by-peer/.
When this happens, reading the response will fail in readResponse:
if _, err = io.ReadFull(mb.conn, data[:tcpHeaderSize]); err == nil {
...
} else if (err != io.EOF && err != io.ErrUnexpectedEOF) ||
mb.LinkRecoveryTimeout == 0 || time.Until(recoveryDeadline) < 0 {
return
}
Since this returns readResultDone we'll just return the error and do nothing else, especially not closing the now-broken connection:
switch res {
case readResultDone:
if err == nil {
mb.lastSuccessfulTransactionID = binary.BigEndian.Uint16(aduResponse)
}
return
case readResultRetry:
continue
}
In turn, the following transaction will fail too (this time with a timeout) since it operates on a dead connection.
My feeling at this stage is, that the entire TCP connection handling is faulty, based on false assumptions and generally more complex than providing additional value. I'll prototype simply closing the connection on any read/write errors and keeping retries only for logical errors (maybe not even that...).
Here's the trimmed version I'll try validating: https://github.com/evcc-io/modbus/pull/14
// Send sends data to server and ensures response length is greater than header length.
func (mb *tcpTransporter) Send(aduRequest []byte) ([]byte, error) {
mb.mu.Lock()
defer mb.mu.Unlock()
var data [tcpMaxLength]byte
// Establish a new connection if not connected
if err := mb.connect(); err != nil {
return nil, err
}
// Set timer to close when idle
mb.lastActivity = time.Now()
mb.startCloseTimer()
// Set write and read timeout
if mb.Timeout > 0 {
if err := mb.conn.SetDeadline(mb.lastActivity.Add(mb.Timeout)); err != nil {
return nil, err
}
}
close := func(err error) {
mb.logf("modbus: close connection: %v", err)
mb.close()
}
// Send data
mb.logf("modbus: send % x", aduRequest)
if _, err := mb.conn.Write(aduRequest); err != nil {
close(err)
return nil, err
}
res, err := mb.readResponse(aduRequest, data[:])
if err != nil {
close(err)
}
return res, err
}
func (mb *tcpTransporter) readResponse(aduRequest []byte, data []byte) ([]byte, error) {
_, err := io.ReadFull(mb.conn, data[:tcpHeaderSize])
if err != nil {
return nil, err
}
aduResponse, err := mb.processResponse(data[:])
if err != nil {
return nil, err
}
err = verify(aduRequest, aduResponse)
if err == nil {
mb.logf("modbus: recv % x\n", aduResponse)
}
return aduResponse, err
}
Maybe one of the maintainers could comment why the TCP implementation originally contained this complexity at all? I'l half-way confident that a simple (external) backoff-based retry will perform just as expected.
I'm afraid none of the original authors is still involved with this project, however we are a few people in the grid-x organization who will take the time to review those changes if you want to submit them.
I have unfortunately lost the ability to push PRs when we've added the capability to share the connection for multiple clients for evcc. The change above is breaking but straightforward. If there's desire to include it I can do a PR manually.
First user feedback is positive- the simple external retry works much better and especially faster than the fine-tuned current logic.
(Sorry I pressed the wrong button, didn't mean to close.)
A lot of the complexity you see in this client was built upon real issues seen in the field, particularly with inverters, one patch after another:
- https://github.com/grid-x/modbus/commit/07ce43b97109f5ae583a6f661b47e592a0324b34
- https://github.com/grid-x/modbus/commit/33425482d645cbbecb598febbfe979fe609c3486
- https://github.com/grid-x/modbus/pull/49
- https://github.com/grid-x/modbus/pull/51
If we were to revert all of that like you did, multiple integrations would suddenly break at gridX. Ideally we would just refuse to support devices with poorly implemented Modbus interfaces and be able to keep this code much simpler, but we just cannot.
I would be suportive of simplifying/improving the handling of network errors, but the revert of all that previous maintainers have built before to support edge cases will not happen in this codebase.
Hey EVCC dev team!
Thank you very much for updating the handling of TCP/IP connections and the error messages.
Just wanted to confirm that this is an issue with "cheap" wallboxes. I for example have a Webasto NEXT (now Ampure Next). A quite cheap wallbox with a very slow processor. The web interface is super slow. Hardly usable. Ok, it does work.
I had a lot of modbus problems. Invested a huge amount into debugging. Weeks and weeks. Non deterministic connection abruptions are hard to tackle. In the end I found out that it's not even modbus, but just plain simply the TCP/IP stack of the wallbox (on the modbus port 502) that breaks down when called "too often" which already is more than once per second.
for i in {1..10}; do echo "Try $i"; nc -zv <IP_OF_YOUR_WALLBOX> 502; sleep 1; done
Try 1
Connection to 192.168.178.24 502 port [tcp/*] succeeded!
Try 2
Connection to 192.168.178.24 502 port [tcp/*] succeeded!
Try 3
<===== crash / timeout
then wallbox needs to be restarted.
Think you‘ve commenged in the wrong repo ☝🏻
@antoineco thanks for the shared context, especially https://github.com/grid-x/modbus/commit/07ce43b97109f5ae583a6f661b47e592a0324b34. It seems that the simplistic "close the connection on error" approach I'm trying to take solves each of the linked issues, if at the cost of missing and needing to retry a request. The SolarEdge issue sounds like it might not be solved by it, but at evcc we've not yet seen it in the wild.
That said, I'd be happy for any other solution to "server side closed connection" that might be less invasive. Anything I've tried did fail in the one of other way and was generally much slower to recover if at all.
If we were to revert all of that like you did, multiple integrations would suddenly break at gridX.
I wouldn't expect that of course. I still think there's missing piece of better handling here. A good example btw is the Varta battery which closes the connection after 4s. That should be an example that one could also build a good test case for?