domain icon indicating copy to clipboard operation
domain copied to clipboard

Improved handling of errors while sending TCP responses.

Open ximon18 opened this issue 9 months ago • 1 comments

This PR contains a couple of fixes for TCP connection error handling:

  • Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.
  • Stop processing a response stream if there is no space left in the response queue.

It also adds a metric for the number of aborted writes that occur.

These issues were discovered when dig disconnected because a large AXFR transfer was slow to start and the default dig 5 second timeout expired.

Interestingly these issues were easier to fix in the WIP xfr branch that I found them on as that is based on the service-layering branch which simplifies the response handling.

ximon18 avatar May 01 '24 10:05 ximon18

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

partim avatar May 06 '24 13:05 partim

I think we should a clear description what happens and how to reproduce. I don't think a tcp write is supposed to fail. So we have to know why it fails.

Philip-NLnetLabs avatar Oct 03 '24 14:10 Philip-NLnetLabs

I’m not sure if considering non-fatal errors with TCP is necessary. Both write_all and read_exact will already cover ErrorKind::Interrupted. ErrorKind::TimedOut should probably treated as fatal – it means the socket has tried re-sending the data a couple times and then has waited.

Retrying when the configured response_write_timeout feels a bit odd, given that this means the actual timeout is response_write_timeout * (response_write_retries + 1), which is a bit unexpected.

Looking up ErrorKind::TimedOut, I now even wonder if having your own write timeout makes sense or whether it is better to just rely on the TCP stack and its configuration.

Perhaps the response write timeout code should be removed, I don't know. It wasn't added in this PR however, so for now I leave that to think about, I haven't removed it.

ximon18 avatar Oct 03 '24 17:10 ximon18

I need to review this PR as I want to be sure that I didn't just accidentally throw out this fix:

Abort a TCP connection if a fatal I/O error is detected when writing to the client, such as the client disconnected.

Update: Yes that was accidentally removed. Naively adding it back in helps when CTRL-C'ing a large dig AXFR transfer preventing a large number of Write error: Broken pipe (os error 32) errors in the log, but instead there are then lots of Unable to queue message for sending: server is shutting down.` errors. That error should probably only be shown once, and I suspect the server isn't shutting down but the connection handler is so the message is also incorrect.

ximon18 avatar Oct 03 '24 17:10 ximon18

The connection is now terminated on error, and the logs of an aborted large XFR transfer look much better:

2024-10-03T19:12:46.301579Z ERROR ThreadId(10) domain::net::server::connection: Write error: Connection reset by peer (os error 104)
2024-10-03T19:12:46.301595Z  WARN ThreadId(10) domain::net::server::connection: Error while shutting down the write stream: Transport endpoint is not connected (os error 107)
2024-10-03T19:12:46.301767Z ERROR ThreadId(30) domain::net::server::connection: Unable to queue message for sending: connection is shutting down.
2024-10-03T19:12:46.302240Z ERROR ThreadId(13) domain::net::server::middleware::xfr::util: Failed to send DNS message to the internal response stream
2024-10-03T19:12:54.977167Z ERROR ThreadId(35) domain::net::server::middleware::xfr::axfr: Internal error: Failed to send final AXFR SOA to batcher: channel closed

ximon18 avatar Oct 03 '24 19:10 ximon18