vsomeip
vsomeip copied to clipboard
[BUG]: vsomeip slow to establish communication with lots of EventGroup
vSomeip Version
v3.4.10
Boost Version
1.82
Environment
Android and QNX
Describe the bug
My automotive system has *.fidl with ~3500 attributes, one per CAN signal. My *.fdepl maps each attribute into a unique EventGroup.
Any time the network connection is established, or broken and re-established, I get an avalanche of ~3500 subscribes, followed by ~3500 acknowledgements, transmitted one-per-frame. The entire sequence does not fit inside a 2 seconds Service Discovery interval. When the work does not complete within the timeout interval then routingmanager will issue StopSubscribe and SubscribeNAK. The system will retry but it will take a long time, at least a couple of Service Discovery intervals.
The train logic is supposed to aggregate these together, sending a train only when it’s full or 5 ms elapse, but there are several places in the code that prevent this.
Reproduction Steps
This behavior is easily reproduced when the system has a *.fidl with 1000s of attributes and *.fdepl puts each into a unique EventGroup.
Subscribe to all ~3500 attributes, use an ifconfig down; sleep 10; ifconfig up to break and re-establish the network connection, look at the tcpdump and observe the network behavior.
Expected behaviour
The train logic should do a "pretty good job" to aggregate many SUBSCRIBE and many SUBSCRIBEACK into each Service Discovery packet.
Logs and Screenshots
With the existing code you should see 1000s of back-to-back SUBSCRIBE like:
5039 9.333908 10.6.0.3 10.6.0.10 SOME/IP-SD 86 SOME/IP Service Discovery Protocol [SubscribeNack]
5040 9.334271 10.6.0.10 10.6.0.3 SOME/IP-SD 104 SOME/IP Service Discovery Protocol [Subscribe]
5041 9.335307 10.6.0.10 10.6.0.3 SOME/IP-SD 98 SOME/IP Service Discovery Protocol [Subscribe]
5042 9.335710 10.6.0.10 10.6.0.3 SOME/IP-SD 114 SOME/IP Service Discovery Protocol [Subscribe]
5043 9.336492 10.6.0.10 10.6.0.3 SOME/IP-SD 98 SOME/IP Service Discovery Protocol [Subscribe]
5044 9.336762 10.6.0.10 10.6.0.3 TCP 66 36651 → 30510 [FIN, ACK] Seq=142 Ack=1 Win=64256 Len=0 TSval=269564273 TSecr=2
each of ~98 bytes, separate packets, nothing or almost-nothing aggregated. In this region we see a SUBSCRIBENACK and socket close because the entire sequence exceeded the 2s Service Discovery timeout interval
I've opened draft pull requests:
- #671
- #670
with the code-changes that I've applied locally to address this issue. I would appreciate any feedback on the approach.
I've updated the pull request for 3.4.x (but not 3.1.x) with an additional commit for a problem discovered in testing. I was getting this warning:
Received an unreliable vSomeIP SD message with too short length field local: 10.6.0.10:30490 remote: 10.6.0.3:30490
and the root-cause was here: https://github.com/COVESA/vsomeip/blob/6c0e9db200fbcfd37879c4b2ff0c8523a29d8eb5/implementation/endpoints/src/udp_server_endpoint_impl.cpp#L682-L690
on_message_received supports multiple messages in a single UDP frame but only processes the message:
- if the message is not SOMEIP-SD
- else if the message is SOMEIP-SD and there’s no subsequent message in the frame
After changing the train logic to aggregate multiple SOMEIP-SD messages into a single UDP frame we want it to process all messages found in the frame, no matter if the messages are SOMEIP or SOMEIP-SD
hi @joeyoravec i have been trying to reproduce your problem on my environment, so that we could validate the fix, however I am having some problems. I used one of the CommonAPI examples (link) to achieve this, with the following configurations: example_configs.zip
Can you check if these make sense? our provide the ones you used so that i could check it.
Thanks!