python-can icon indicating copy to clipboard operation
python-can copied to clipboard

Socketcan uses incorrect flag for direction

Open chemicstry opened this issue 3 years ago • 13 comments

The socketcan implementation is using MSG_DONTROUTE flag instead of MSG_CONFIRM flag to set the is_rx message field. This breaks applications where virtual can interface is used, because all messages originate from the same host and thus is_rx is always false. IMO the correct behavior should be to use MSG_CONFIRM flag instead, but this could break some existing applications if they use two instances of the same socket simultaneously. Opinions?

MSG_DONTROUTE and MSG_CONFIRM docs (section 4.1.7): https://www.kernel.org/doc/Documentation/networking/can.txt

chemicstry avatar Mar 04 '22 00:03 chemicstry

The socketcan implementation is using MSG_DONTROUTE flag instead of MSG_CONFIRM flag to set the is_rx message field. This breaks applications where virtual can interface is used, because all messages originate from the same host and thus is_rx is always false.

But this is the correct status. The difference is

  1. is the CAN frame created in that host
  2. is the CAN frame received from the outside (e.g. received by a real CAN interface)

As vcan traffic can only occur when it is created on the host, vcan traffic is always "TX" type traffic.

IMO the correct behavior should be to use MSG_CONFIRM flag instead, but this could break some existing applications if they use two instances of the same socket simultaneously. Opinions?

You correctly point out the problem. MSG_CONFIRM is set when the tx traffic is received by exactly the socket it has been sent on. When you have a second socket (e.g. for candump) reading that traffic the MSG_CONFIRM flag is never set on this 'receiving-only' socket.

Kernel code:

        *pflags = 0;
        if (oskb->sk) // <= only sbks generated by a socket have this sk value set
                *pflags |= MSG_DONTROUTE; // => this is local generated traffic
        if (oskb->sk == sk) // <= check if it was exactly me who generated that skb
                *pflags |= MSG_CONFIRM; // => yes. My outgoing skb has been echo'ed to me

MSG_DONTROUTE and MSG_CONFIRM docs (section 4.1.7): https://www.kernel.org/doc/Documentation/networking/can.txt

Edit: To sum it up: There is no bug IMO.

hartkopp avatar Mar 04 '22 06:03 hartkopp

I agree with what you are saying, but I can't think of a use case where the current behavior is desired.

Most of the applications use the is_rx = false messages to verify that the message was transmitted successfully and/or to get a precise transmission timestamp. So it makes sense that is_rx would only be false for messages transmitted on the current socket. I don't think anyone has encountered this problem yet, because when using a single socket with a single CAN adapter (the common case) there is no difference between MSG_CONFIRM and MSG_DONTROUTE. However, there is a problem when virtual can interface is used for unit tests as it is impossible to simulate outside traffic, all messages are is_rx = false.

I don't see any use case where the current behavior can be useful, but please tell me if there is something I don't know.

chemicstry avatar Mar 07 '22 15:03 chemicstry

Any traffic which comes from a virtual CAN interface has to be sent to this virtual CAN interface. And this can only be done from the local host.

The problem is, that the sk value (see code in my last post) is used as an identifier for this and removing this tag would break almost everything.

But if having this tag is_rx = true is that important for your unit test, I can probably provide a solution using the can-gw:

Run cangw -A -s vcan0 -d vcan1 -e (as root)

When running cansend vcan0 123#1122 in a separate terminal you can see this with candump -x -td any:

$ candump -x -td any
 (000.000000)  vcan0  TX - -  123   [2]  11 22
 (000.000070)  vcan1  RX - -  123   [2]  11 22
 (000.588690)  vcan0  TX - -  123   [2]  11 22
 (000.000026)  vcan1  RX - -  123   [2]  11 22
 (000.647955)  vcan0  TX - -  123   [2]  11 22
 (000.000026)  vcan1  RX - -  123   [2]  11 22
 (000.499139)  vcan0  TX - -  123   [2]  11 22
 (000.000073)  vcan1  RX - -  123   [2]  11 22

hartkopp avatar Mar 07 '22 16:03 hartkopp

When you then want to 'replay' a logfile to the virtual CAN that creates the desired flags:

  • send TX frames to vcan1
  • send RX frames to vcan0

hartkopp avatar Mar 07 '22 16:03 hartkopp

Thanks for the workaround, I will keep that as an option for unit tests.

However, I think that we still have some kind of miscommunication. What I'm proposing is: change is_rx = not bool(msg_flags & socket.MSG_DONTROUTE) to is_rx = not bool(msg_flags & socket.MSG_CONFIRM) at https://github.com/hardbyte/python-can/blob/3b47d42f38c7ad5adea5c78bc482df9f6c106dac/can/interfaces/socketcan/socketcan.py#L565

The current behavior: is_rx indicates that message arrived from a different machine. The suggested behavior: is_rx indicates that message arrived from a different machine or socket (application).

Or in other words, is_rx = false should be a confirmation of correct transmission for the application, but instead it is currently some weird information about kernel-level message routing.

The problem is, that the sk value (see code in my last post) is used as an identifier for this and removing this tag would break almost everything.

Could you elaborate? I know that my suggestion is a breaking change, but so far I haven't seen a use case for current behavior. Moreover, the API has a receive_own_messages parameter for socketcan bus, but there is no way to distinguish these messages, because MSG_CONFIRM flag is not exposed. It would be an option to add additional is_confirm field to Message object, but if the current MSG_DONTROUTE flag has no use, then changing is_rx behavior would keep the API more concise.

And this is not only a problem with vcan. If two applications are using one physical CAN interface, they cannot distinguish if is_rx = false message was sent by this or the other application. Sure you could check if messages are equal, but this is not robust, because they do not have any unique ID and collisions are possible.

chemicstry avatar Mar 07 '22 23:03 chemicstry

However, I think that we still have some kind of miscommunication. What I'm proposing is: change is_rx = not bool(msg_flags & socket.MSG_DONTROUTE) to is_rx = not bool(msg_flags & socket.MSG_CONFIRM) at

https://github.com/hardbyte/python-can/blob/3b47d42f38c7ad5adea5c78bc482df9f6c106dac/can/interfaces/socketcan/socketcan.py#L565

The current behavior: is_rx indicates that message arrived from a different machine. The suggested behavior: is_rx indicates that message arrived from a different machine or socket (application).

Ah, now I got it. So that every CAN frame that is not sent by your own socket/application get's an "RX" tagging, right?

So this would move the rx/tx status from the host perspective to the application perspective - as long as you use exactly one socket in your application. This change should be feasible but I'm not sure about the expectations of other python-can users on this topic.

Or in other words, is_rx = false should be a confirmation of correct transmission for the application, but instead it is currently some weird information about kernel-level message routing.

The question is whether we should introduce something new that makes it clear, that we have an confirmation frame (aka echo'ed frame) here.

The problem is, that the sk value (see code in my last post) is used as an identifier for this and removing this tag would break almost everything.

Could you elaborate? I know that my suggestion is a breaking change, but so far I haven't seen a use case for current behavior.

Maybe there's a miscommunication in the other direction too.

The reason for the message flags is described here: https://www.kernel.org/doc/html/latest/networking/can.html#raw-socket-option-can-raw-recv-own-msgs

And there were good reasons and use-cases to do it this way - including weird use-cases from people that control cranes with it. The kernel API is written into stone.

Moreover, the API has a receive_own_messages parameter for socketcan bus, but there is no way to distinguish these messages, because MSG_CONFIRM flag is not exposed. It would be an option to add additional is_confirm field to Message object, but if the current MSG_DONTROUTE flag has no use, then changing is_rx behavior would keep the API more concise.

Yes, as suggested above adding something like is_confirm could make sense - especially due to the fact that the phython-can API provides such a receive_own_messages parameter.

hartkopp avatar Mar 08 '22 08:03 hartkopp

Yes, as suggested above adding something like is_confirm could make sense - especially due to the fact that the phython-can API provides such a receive_own_messages parameter.

@chemicstry would you like to provide a patch for such an improvement?

hartkopp avatar Mar 10 '22 19:03 hartkopp

I could do that.

However, since python-can is supposed to be interface-agnostic, it would be wise to firstly define what is_rx and is_confirm mean at the API level for all interfaces. Since is_rx and is_confirm have inverted logic, it is easier to define is_tx = !is_rx and specify them as:

  • is_tx: Message was sent by any instance (socket)
  • is_confirm: Message was sent by this instance (socket)

For interfaces that do not support multiple simultaneous access (most physical adapters without socketcan), it is always true that is_tx == is_confirm. Note that I'm using is_tx just to simplify the definition, we can leave is_rx as is in the API.

If this seems OK, I will make a PR, hopefully sometime next week.

chemicstry avatar Mar 12 '22 12:03 chemicstry

For interfaces that do not support multiple simultaneous access (most physical adapters without socketcan), it is always true that is_tx == is_confirm.

I don't think so.

The feature whether a CAN interfaces 'echoes' the sent CAN frame or not does not necessarily stick on socketcan.

E.g. you are currently working on the GS_USB interface - and IIRC that interface supports the 'echo' of sent CAN frames on the CAN interface itself.

So the receive_own_messages switch is not limited to socketcan IMO.

And only when receive_own_messages is switched on, the is_confirm CAN frames might appear, right?

hartkopp avatar Mar 12 '22 15:03 hartkopp

Yeah, so if loopback/echo frames are enabled, is interface-specific. Some interfaces don't support it at all, some have it enabled by default (gs_usb), some have parameters to enable it (socketcan).

However, regardless if they are enabled, for non-shared interfaces is_confirm == !is_rx invariant always holds. For example, gs_usb interface would have is_confirm = frame.echo_id != GS_USB_NONE_ECHO_ID and is_rx = not is_confirm. If you disable echo frames (functionality not exposed for gs_usb yet), then is_confirm is always false and is_rx is always true.

Only shared interfaces (i.e. socketcan) could have is_rx == false and is_confirm == false, when on the same machine frame is sent from one socket and observed on the other.

EDIT: To complete the example, socketcan would have:

  • is_rx = not bool(msg_flags & socket.MSG_DONTROUTE)
  • is_confirm = bool(msg_flags & socket.MSG_CONFIRM)

chemicstry avatar Mar 12 '22 15:03 chemicstry

Yeah, so if loopback/echo frames are enabled, is interface-specific. Some interfaces don't support it at all, some have it enabled by default (gs_usb), some have parameters to enable it (socketcan).

Is this interface to switch the receive_own_messages setting commonly known?

So that it is fixed off for most CAN interfaces, fixed on for GS_USB and configurable for socketcan? IMO this has to be clear to the user as they could get different content.

However, regardless if they are enabled, for non-shared interfaces is_confirm == !is_rx invariant always holds.

Really? is_confirm == !is_rx means is_confirm == is_tx so MSG_DONTROUTE == MSG_CONFIRM ?

The term 'non-shared interfaces' is misleading IMO. We should define the interface capabilities by how receive_own_messages is controlled or fixed.

EDIT: To complete the example, socketcan would have:

* `is_rx = not bool(msg_flags & socket.MSG_DONTROUTE)`

* `is_confirm = bool(msg_flags & socket.MSG_CONFIRM)`

Yes - this is what I would expect as implementation in socketcan.py. :-)

hartkopp avatar Mar 12 '22 16:03 hartkopp

Hey, sorry for the long silence.

I tried making a PR for this, but there's just too much inconsistency between interfaces and fixing that would be a major breaking change. So firstly, I'm not sure if such changes would be welcome and secondly, I don't feel qualified enough to drive the API changes of the whole library.

As for the issue, I'm not sure if it's better to keep this open, in case someone takes up the job, or just close it.

chemicstry avatar Jun 01 '22 08:06 chemicstry

no need to open "receive_own_messages ", change bus.send ,haha...

old_send = bus.send
def new_send(_self, msg: Message, timeout: Optional[float] = None) -> None:
     msg.timestamp = datetime.datetime.now().timestamp()
     old_send(msg=msg, timeout=timeout)
     self.logger(msg)
bus.send = types.MethodType(new_send, bus)

luojiaaoo avatar May 20 '23 16:05 luojiaaoo