aiocoap icon indicating copy to clipboard operation
aiocoap copied to clipboard

Ability to override default timeouts

Open weili-jiang opened this issue 5 years ago • 1 comments

There are a number of timeouts that are currently defined as constants: https://aiocoap.readthedocs.io/en/latest/module/aiocoap.numbers.constants.html#aiocoap.numbers.constants.REQUEST_TIMEOUT

Some of those are not in the RFC at all. Even if they are, it would be nice to be able to set them on a per request or context level.

The motivation is that in a specific application network known to have relatively low latency (but packet loss), it may be desirable to have faster retries and a custom timeouts.

weili-jiang avatar Oct 17 '19 19:10 weili-jiang

I concur. In my scenario I don't really need to change it on a per request, I don't know if that makes sense, I was thinking on a way of globally changing those values for my scenario (IoT with sleepy end devices and huge latencies).

A very straightforward of achieving this is to create a Constants class with all the stuff, i.e.:

class Constants:
    ACK_TIMEOUT = 10.0
    MAX_RETRANSMIT = 6
    ...

This is a way to allow the application to change those values. However, this implies a heavy refactoring --inconsequential, but ubiquitous.

Is this something desirable for this library?

alexbarcelo avatar Jul 20 '21 11:07 alexbarcelo

I think we're starting to hit this for HA + homekit_controller & sleepy Thread + HAP accessories. The average ping time for one individual is ~2600ms, over the 2.0s ACK_TIMEOUT. Worse, the retransmits are shot down somewhere, maybe at the Border Router, killing the connection attempt entirely.

roysjosh avatar Nov 10 '22 18:11 roysjosh

On Thu, Nov 10, 2022 at 10:50:23AM -0800, Joshua Roys wrote:

I think we're starting to hit this for HA + homekit_controller & sleepy Thread + HAP accessories. The average ping time for one individual is ~2600ms, over the 2.0s ACK_TIMEOUT. Worse, the retransmits are shot down somewhere, maybe at the Border Router, killing the connection attempt entirely.

I'll have a look at it next time I get my hands on aiocoap (which might be some time given I'm a bit swamped ATM). But one thing in advance:

If retransmits are swallowed by a router, chances are you'll run into trouble no matter the timeout. (If the BR were acting as a proper intercepting proxy, it'd send an ACK and manage retransmits -- a behavior I'd even encourage if it were explicit and not intercepting). So even when this becomes configurable, please still look at what swallows the messages, or how the BR behaves.

chrysn avatar Nov 10 '22 18:11 chrysn

On Thu, Nov 10, 2022 at 10:50:23AM -0800, Joshua Roys wrote: I think we're starting to hit this for HA + homekit_controller & sleepy Thread + HAP accessories. The average ping time for one individual is ~2600ms, over the 2.0s ACK_TIMEOUT. Worse, the retransmits are shot down somewhere, maybe at the Border Router, killing the connection attempt entirely. I'll have a look at it next time I get my hands on aiocoap (which might be some time given I'm a bit swamped ATM). But one thing in advance: If retransmits are swallowed by a router, chances are you'll run into trouble no matter the timeout. (If the BR were acting as a proper intercepting proxy, it'd send an ACK and manage retransmits -- a behavior I'd even encourage if it were explicit and not intercepting). So even when this becomes configurable, please still look at what swallows the messages, or how the BR behaves.

Thread border routers are physical, link, network, and transport layer devices, with IPv6 as the native Thread network layer (BRs can provide NAT to integrate with IPv4-only LANs) and otherwise being transparent for end-to-end IP communication. So there is no expectation that a BR would understand COAP and provide proxy services for it.

That being said, Thread 1.3 added mandatory support for DNS service discovery and registration proxying (to avoid the costs of multicast mdns on the Thread mesh network and to allow sleepy Thread devices to, well, sleep), so it may be the case that in the future other application protocols or network services will be specifically handled by BRs.

I don't have data proving or disproving that retransmits are shot down, but based on the Thread specification a well-behaved BR should not be filtering packets in such a manner (no deep packet inspection allowing application protocol-specific rate limiting or filtering). Of course, with UDP, there is no delivery guarantee, and that seems like a likely-enough explanation.

jfroy avatar Nov 15 '22 22:11 jfroy

I'm not sure whether it is the border router or perhaps a Thread router, but something is sending icmp6 "no route to host" errors back to HA. It appears to match up with the retransmit to a sleepy device but I haven't been able to reproduce this with my small network of FTD nodes. I'm leaning towards a Thread router trying to indicate that it can't reach a child node...

2022-11-09 19:13:39.127 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:cfc3:554d:edca:32d4]:5683/2
2022-11-09 19:13:39.139 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:17d8:ee7d:c2ad:6eb6]:5683/2
2022-11-09 19:13:39.147 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:7935:fc6e:69a2:fe5f]:5683/2
2022-11-09 19:13:39.156 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:4b1b:ed70:dc3a:7438]:5683/2
2022-11-09 19:13:39.159 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:d350:15d9:7a39:fc10]:5683/2
2022-11-09 19:13:39.170 DEBUG (MainThread) [aiohomekit.controller.coap.connection] Pair verify uri=coap://[fdd8:9c7d:c2d1:0:ba37:3c9b:7899:87a5]:5683/2
2022-11-09 19:13:41.566 INFO (MainThread) [coap-server] Retransmission, Message ID: 52364.
2022-11-09 19:13:41.793 INFO (MainThread) [coap-server] Retransmission, Message ID: 33951.
2022-11-09 19:13:41.800 INFO (MainThread) [coap-server] Retransmission, Message ID: 50357.
2022-11-09 19:13:41.894 INFO (MainThread) [coap-server] Retransmission, Message ID: 10841.
2022-11-09 19:13:42.063 INFO (MainThread) [coap-server] Retransmission, Message ID: 42904.
2022-11-09 19:13:42.119 INFO (MainThread) [coap-server] Retransmission, Message ID: 56879.
2022-11-09 19:13:42.211 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable
2022-11-09 19:13:42.274 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable
2022-11-09 19:13:42.277 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable
2022-11-09 19:13:42.280 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable
2022-11-09 19:13:42.282 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable
2022-11-09 19:13:42.284 ERROR (MainThread) [coap-server] Error received and ignored in this codepath: [Errno 113] Host is unreachable

roysjosh avatar Nov 17 '22 23:11 roysjosh

Thread devices are required to implement Destination Unreachable (type 1) icmp6 messages (specifically RFC 4443 section 3.1), so any node may have sent that back. As you speculate, it is likely the border router, though I did not find code implementing that behavior in OpenThread.

Destination Unreachable (type 1) with code 0 (No route to destination) icmp6 messages are sent by FTDs when their EID (endpoint identifier)-to-RLOC (routing locator) cache contains an invalid entry. Both EIDs and RLOCs are IPv6 addresses, but EIDs are visible to applications and do not change for a given device even if the mesh topology changes. RLOCs are private IPv6 addresses used to actually deliver datagrams and do change when the mesh topology changes. However, I don't think you'd ever see those messages forwarded outside of the Thread mesh.

jfroy avatar Nov 18 '22 00:11 jfroy

Exploring how to fix this: The high-level messages aiocoap usually handles and the nitty-gritty details of transports are quite decoupled.

I'm leaning towards having a bunch of parameters in an object, similar to how @alexbarcelo suggested. These would take the current (module based) constants as defaults.

I'm not sure how to guide the selection of that object. How would you prefer to configure it, or how would you know which parameters to choose? Would it work to have these as hints on the message, so that the client sets these hints like it sets whether it's rather have this CON or NON? Would that work for the response as well? Would it be more practical to have a per-context configurable decision function that looks at the address (say, looks up whether the address is in a network known to be a Thread managed one) and decides which set of defaults to use?

chrysn avatar Nov 21 '22 15:11 chrysn

Taking things up from another thread / @jc2k:

The exact sleep interval is available over the HAP protocol through characteristic 0000023A-0000-1000-8000-0026BB765291. For all my battery powered thread devices, its 5s. But thats probably not representitive.

Not being familiar with details of Thread I'll assume that the sleeping device is a server, and has been discovered and possibly been probed for that characteristic. Would you, then, consider it practical to pass a parameter object in with each request sent to a peer of which a sleep value is known?

chrysn avatar Nov 21 '22 16:11 chrysn

From an API usability POV, I'd probably want to defer to @roysjosh here as he did the hard work on this, it looks like we could work with that...

Jc2k avatar Nov 21 '22 16:11 Jc2k

Please have a look at https://github.com/chrysn/aiocoap/pull/294 to see whether that'd help with your use case.

The idea is that you'd subclass TransportTuning (eg. to a form that takes ACK_TIMEOUT as an instance parameter) and then pass that into every message you send to a known-sleep node as the transport_tuning parameter.

chrysn avatar Nov 21 '22 18:11 chrysn

This was accidentally closed when I entered the wrong issue number into 96eeb675bd9a35fe04634c890b42a851b94145e8 -- that should have closed #288.

chrysn avatar Nov 21 '22 18:11 chrysn