aiocoap Notification sending fail

I am using aiocoap to develop a COAP server. Some resources are observable. When link is temporary down, if a Notify is sent to client, in logs, I have an exception like that:

Jan 18 17:45:11 klk972 coap-server An exception occurred while rendering a resource: OSError(9, 'Bad file descriptor') Traceback (most recent call last): File "/home/cro/utilc/keros/aiocoap/aiocoap/protocol.py", line 368, in _render_to_plumbing_request await self._render_to_plumbing_request_inner(plumbing_request, File "/home/cro/utilc/keros/aiocoap/aiocoap/protocol.py", line 557, in _render_to_plumbing_request_inner response = await self.serversite.render(request) File "/home/cro/utilc/keros/aiocoap/aiocoap/resource.py", line 362, in render child, subrequest = self._find_child_and_pathstripped_message(request) File "/home/cro/utilc/keros/aiocoap/aiocoap/resource.py", line 335, in find_child_and_pathstripped_message request.get_request_uri(local_is_server=True)) File "/home/cro/utilc/keros/aiocoap/aiocoap/message.py", line 443, in get_request_uri netloc = refmsg.remote.hostinfo_local File "/home/cro/utilc/keros/aiocoap/aiocoap/transports/tcp.py", line 268, in hostinfo_loca host, port, * = self._transport.get_extra_info('socket').getsockname() File "/usr/lib/python3.8/asyncio/trsock.py", line 88, in getsockname return self._sock.getsockname()#012OSError: [Errno 9] Bad file descriptor

This is normal, underlying socket is dead. But notification is definitely lost and application (above aiocoap library) is not aware that notification has not been sent successfully. When link comes back, this notification is not re-sent.

I see two ways to handle this problem:

when calling updated_state, application gives a coroutine. It will be called by aoicoap library if notification sending fails. Maybe that's difficult if we have more than one client to notify. That's not my case but I don't know what to do if only some of the clients are notified.
aiocoap library handle retransmission of notifications by itsenf. In this case, the questions are: when should retransmissions be done? When do we stop to try to retransmit?

Am I right when I say that something is missing in aiocoap library to do that or is there an existing way I didn't see? If there is something to do in aiocoap, what is the best way to do it: callback? handle retransmissions in library? other thing?

Best Regards,

Christophe Ronco

Jan 19 '21 10:01 christopheronco

On the practical level, there's nothing the server can do about this (given you're using TCP) -- it can't go back and reconnect. It can't also re-send the notification when the link comes back: When the client reconnects, it's on a different endpoint than before (for illustrative purposes, consider that the client has a different port), and the notification would not be valid there. Whenever the client reconnects, it has to issue all its observations anew.

On the architectural level, it's not the server's job to keep trying: If the network connection fails, at some point the client will notice, and then all its running observations will raise.

On a slightly higher abstraction level than aiocoap currently provides, this would be possible -- a user could ask the library to keep giving it fresh representations of a resource, and the library would automatically reconnect while that interest is there. Such an abstraction level does not (yet) exist in aiocoap.

Does your application have meaningful Max-Age values sent with the notifications? If so, it'd be an option to trigger the observation reconnection once the max-age runs out. This is the generally applicable mode of operation that also applies to non-TCP connections. (Doing that in an automated fashion would be a step towards that higher abstraction mentioned before, but, alas, is not implemented yet either -- let me know if it'd help, though, maybe I can bubble it up on the list).

If it has not (where IMO it should but there different schools of thought around AFAIK), and your client doesn't have any other traffic with the server anyway that'd flush out the resets, you'd have to exchange some sort of keepalive. Right now, the easiest thing to do is to send GET empty requests; the more efficient ways (TCP keepalives or CoAP PING) are not implemented yet in aiocoap should be straightforward to hack in.

Jan 19 '21 10:01 chrysn

My application is a Lightweight M2M client (over TCP).

In this protocol, lwM2M client will register to server and then update registration from time to time (COAP Post requests from lwM2M client to lwM2M server). Once registration is done, server will access client resources using COAP GET, PUT, POST, ... methods (COAP request from lwM2M server to lwM2M client). That's why I said my application is a COAP server even if COAP server starts connection.

After a registration from client, server will read all client resources once and observe some of them. Then notifications from LwM2M client will let server know that something has changed in client resources.

So I could use update registration messages as keepalive. And that's what they are in LwM2M protocol because if server does not receive such a message during a configurable amount of time, it will force client to register again (and read all resources once again).

If connected over cellular in busy network environments, we often have short network interruptions. I am looking for a way to not read all client resources after each network loss.

If I have no mandatory update-registration or notification during disconnection it works. This scenario is OK: t0) update-registration (POST from lwM2M client) t1) net down (Ethernet cable plugged out) t2) net up t3) update-registration (POST from lwM2M client)

If I have to send a notification after a disconnection - reconnection, it works. This scenario is OK even if notificaiton is the first message after network up: t0) update-registration t1) observe from lwm2m server t2) net down (Ethernet cable plugged out) t3) net up t4) updated_state -> notification sent t5) update-registration After what you replied, I thought this would not work. But it does even if I assume that socket used during observe and socket used to send notification is not the same. The notification is correctly received by lwM2M server and taken into account. I don't know what happens at socket level, I assume aiocoap recreates a socket to send notification. Do you think this is normal ?

If I have to send a notification during a network disconnection, notification is lost and never retried and application is not aware of that. t0) update-registration t1) observe from lwm2m server t2) net down (Ethernet cable plugged out) t3) updated_state -> notification sent, failed t4) net up t5) update-registration

After update-registration, link to server is back. I would like to resend the lost notification at that time (or maybe earlier if I am able to detect to net up event).

Maybe I should not use updated_state method and send notification myself to be able to handle error cases as I want (but I don't know how to send a COAP notify message). Is this possible using aiocoap (send notification from application)?

Jan 19 '21 16:01 christopheronco

I see and will get back to you. In short (from mobile): This is one of the "rare role reversal" situations. The planned resolution is still not to do anything on notify failures, but to add something after the registration that will indicate the shutdown there.

Jan 19 '21 16:01 chrysn

Coming back at this, here's how I plan to accommodate this use case, and it'd be helpful if you could tell me whether you could work with that:

Notifications go away as they used to.
Messags already have a .remote object that represents the peer the message is to be exchanged with.
The .remote of some transports (that is, the stateful ones) will gain a .keepalive() coroutine. That coroutine never terminates, but ensures that the connection is not closed from our side (which currently doesn't happen anyway and might happen in future if there are no pending requests or observations on it).
Running .keepalive() raises an exception if the connection goes away for some reason (be it that the peer shut it down or the network was interrupted in some way).

In the end, all this should allow the LwM2M client to run its main loop about like this:

while True:
    registration_message = Message(code=POST, url=server_url)
    response = await ctx.request(registration_message).response_raising
    try:
        await response.remote.keepalive()
    except aiocoap.ConnectionLost:
        continue

Would this work for you?

(For those interested in the details: For DTLS, the connection pool was recently enhanced for easier cleanup, and connections can now get terminated early unless the addresses are kept around, which usually happens in Python. Applying this to TCP is pending, and might even result in reduced functionality for this given case if the server does not establish the observations in a timely manner).

Jan 23 '21 09:01 chrysn

Small change of plans / note to self: it might rather be

while True:
    ...
    try:
        async with response.remote.keepalive():
            await asyncio.sleep(timeout)
    except aiocoap.ConnectionLost:
        continue

because having a coroutine that does has an effect while run is strange, while that's just what contexts do. (Also because LwM2M is based on the resource directory, and unless that implements some very specific extension, the RD endpoint is supposed to renew its registration after whichever timeout it picked initially).

Jan 23 '21 18:01 chrysn