vsomeip [BUG]: vsomeip slow to restart with lots of EventGroup

[BUG]: vsomeip slow to restart with lots of EventGroup

Open joeyoravec opened this issue 1 year ago • 2 comments

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

Android and QNX

Describe the bug

My automotive system has *.fidl with ~3500 attributes, one per CAN signal. My *.fdepl maps each attribute into a unique EventGroup.

Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce restart() but during this time any Subscribe will receive SubscribeNack in response:

4191	105.781314	10.6.0.10	10.6.0.3	SOME/IP-SD	1408	SOME/IP Service Discovery Protocol [Subscribe]
4192	105.790868	10.6.0.3	10.6.0.10	SOME/IP-SD	1396	SOME/IP Service Discovery Protocol [SubscribeNack]
4193	105.792094	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4194	105.801525	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4195	105.802118	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]
4196	105.819610	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]

as the number of EventGroup scales to a large number, this become catastrophic to performance.

In service_discovery_impl::handle_eventgroup_subscription_nack() each EventGroup calls restart(): https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/service_discovery/src/service_discovery_impl.cpp#L2517-L2521

and in tcp_client_endpoint_impl::restart() while ::CONNECTING the code will "early terminate" from maximum 5 restarts: https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/endpoints/src/tcp_client_endpoint_impl.cpp#L77-L85

thereafter the code will fall through, calling shutdown_and_close_socket_unlocked() and perform the full restart even while a connection is in progress.

As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.

Reproduction Steps

My reproduction was:

start with fully-established communication between tse and tce
tce enters suspend-to-ram with TCP socket established
allow tse to continue running, exceed TCP keepalive timeout, and close the TCP socket
tce resumes from suspend-to-ram thinking TCP socket is still established, then discovers it to be closed

but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.

Expected behaviour

Performance should be better.

Logs and Screenshots

No response

May 04 '24 01:05 joeyoravec

We came up with 3 possible solutions;

eliminate the tce restart() call from service_discovery_impl::handle_eventgroup_subscription_nack(). It's not clear why this is required or how it would help
modify tce restart() to "early terminate" better, perhaps an unlimited number of times within the 5 second timeout
ensure that SOMEIP-SD gets inhibited around any event like suspend-to-ram where network communication will be lost. Try to prevent Subscribe until the TCP socket gets re-established

Interested in feedback on what would be most effective

May 04 '24 01:05 joeyoravec

vsomeip vsomeip copied to clipboard

[BUG]: vsomeip slow to restart with lots of EventGroup

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

vsomeip
vsomeip copied to clipboard