zephyr
zephyr copied to clipboard
Reordered net context connect operation #69145
Moved socket offload connect ahead of non offload connect implementation to fix issue with multiple network interfaces See #69145
Hello @matt-wood-ct, and thank you very much for your first pull request to the Zephyr project! Our Continuous Integration pipeline will execute a series of checks on your Pull Request commit messages and code, and you are expected to address any failures by updating the PR. Please take a look at our commit message guidelines to find out how to format your commit messages, and at our contribution workflow to understand how to update your Pull Request. If you haven't already, please make sure to review the project's Contributor Expectations and update (by amending and force-pushing the commits) your pull request if necessary. If you are stuck or need help please join us on Discord and ask your question there. Additionally, you can escalate the review when applicable. 😊
Renamed commit to follow guidelines
Forgot to mention earlier, you should add Fixes #69145
line to commit message body so that the issue gets linked to the commit (helps when generating release notes).
Overall, I am not convinced about the change. The checks in the original code were done before the offloaded call in order to check the addresses etc. Now we are not doing those checks for offloaded interface. Where/why exactly the crash happens, could we fix things there instead of this shortcut?
Agreed, plus when it comes to NET_OFFLOAD and mutliple interfaces it's not really supported well, there's a more serious issue on the table: https://github.com/zephyrproject-rtos/zephyr/blob/main/subsys/net/ip/net_context.c#L456 So any crashes in such setup could be as well a consequence of that.
such setup
Thanks for the feedback, I think you are both right. I've been doing more testing and my "fix" does indeed just make other things worse, I am surprised it worked at all. I will continue my testing and see if I can find a better fix. I had not spotted that oddity and you may be right that might be the root cause of the issues I am encountering.
Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢
Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢
Check this comment where the user describes how they handle this limitation - as a workaround, they call net_if_set_default()
before creating socket. I have to admit it's not perfect, but might help.
Either way I must find a workable solution as the product I am working on does have all 3 interfaces and we do need to switch between the in runtime 😢
Check this comment where the user describes how they handle this limitation - as a workaround, they call
net_if_set_default()
before creating socket. I have to admit it's not perfect, but might help.
Hi Robert, Unfortunately we are doing that and we still get crashes from inside the stack, the scenario current works as follows:
- Wifi comes up (socket offload driver)
- sockets are working fine
- GSM comes up (PPP driver)
- wifi set as default
- socket connect causes bus fault
So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)
So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)
It would be really helpful if you could pin point the exact line where the crash happens, without this info it is difficult to suggest anything.
Network ifaces (to show interface numbering):
uart:~$ net iface
Interface eth0 (0x80092260) (Ethernet) [1]
===================================
Interface is down.
Interface net0 (0x800922e0) (IP Offload) [2]
=====================================
Interface is down.
Interface ppp0 (0x80092360) (PPP) [3]
==============================
Interface is down.
Log when wifi is enabled and comes up first
[00:00:16.162,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:16.262,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:16.500,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:00:21.742,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:21.842,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:22.128,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:00:25.459,000] <inf> lib_sim7070: Bringing interfaces up! Successfully started PPP
[00:00:25.471,000] <inf> lib_sim7070: AT channel 2 connected to GSM_2
[00:00:27.370,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:27.470,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:27.571,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:36382 iface 3 (0x80092360)
So in the above log the wifi is up and tested (testing consists of creating a socket, opening it then closing it to 1.1.1.1).
Before each test net_if_set_default
is called with the relevant iface. But you can see it attempted to bind using the PPP interface even though the wifi was set as default...
Hope this log helps to explain the series of events more clearly
Edit:
Here is the code for the testInterface
function:
bool testInterface(supportedNetworkInterface_t *interface)
{
bool retVal = false;
int sock;
int err;
struct sockaddr_in addr4;
struct net_if* originalDefault = net_if_get_default();
LOG_DBG("Test %s (%p)", interface->name, interface->interface);
net_if_set_default(interface->interface);
k_msleep(100);
addr4.sin_family = AF_INET;
addr4.sin_port = htons(NET_TEST_IPV4_PEER_PORT);
inet_pton(AF_INET, NET_TEST_IPV4_PEER,
&addr4.sin_addr);
sock = socket( addr4.sin_family,
SOCK_STREAM, IPPROTO_TCP );
LOG_DBG("Create socket %d", sock);
k_msleep(100);
if (0 <= sock)
{
err = connect( sock, (struct sockaddr*) &addr4, sizeof( addr4 ) );
LOG_DBG("Connect socket %d", err);
close( sock );
if (err >= 0)
{
/* socket connected, connection is OK */
retVal = true;
}
}
net_if_set_default(originalDefault);
return retVal;
}
So currently we are getting crashes with the default configured correctly, any suggestions? (I am still intently debugging of course)
It would be really helpful if you could pin point the exact line where the crash happens, without this info it is difficult to suggest anything.
Sorry just saw your question, the actual crash is here:
conn going into
tcp_in
is null and it tries to access a member of conn
which I believe causes the usage fault
conn going into
tcp_in
is null and it tries to access a member ofconn
which I believe causes the usage fault
Hmm, but we access the data behind conn pointer in previous line and it does not crash there?
conn going into
tcp_in
is null and it tries to access a member ofconn
which I believe causes the usage faultHmm, but we access the data behind conn pointer in previous line and it does not crash there?
I don't think it does, that's just how the PC is rendered by the debugger, I interpret that to mean it crashed on the line before.
As a confirmation I added in a null trap and it did indeed stop the crash (but of course the network didn't work)
Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect
literally changes to iface 3 (GSM PPP) after running the function bind_default
, see the log below:
[00:00:29.274,000] <dbg> net_ctx: net_context_connect: CONNECT IFACE: 2
[00:00:29.274,000] <dbg> net_ctx: net_context_connect: 1200 CONNECT IFACE: 2
[00:00:29.274,000] <dbg> net_ctx: net_context_connect: 1282 CONNECT IFACE: 2
[00:00:29.275,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:61425 iface 3 (0x80092360)
[00:00:29.275,000] <dbg> net_ctx: net_context_connect: 1287 CONNECT IFACE: 3
[00:00:29.275,000] <dbg> net_ctx: net_context_connect: 1299 CONNECT IFACE: 3
snippet of code from the function to show where those prints are:
rport = addr4->sin_port;
LOG_DBG("1282 CONNECT IFACE: %d", context->iface);
ret = bind_default(context);
if (ret) {
goto unlock;
}
LOG_DBG("1287 CONNECT IFACE: %d", context->iface);
net_sin_ptr(&context->local)->sin_family = AF_INET;
so net_context_connect
prepare the wifi interface for this connection then near the end binds it to the PPP connection which of course since its partually configured does not work (and crashes if I remove my NULL trap)
Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect literally changes to iface 3 (GSM PPP) after running the function bind_default, see the log below:
I've been investigating the case and was going to ask about this, it seems that with current implementation if net_context_bind()
for whatever reason decides that non-offloaded interface is a better choice to communicate with the remote, it'd overwrite the interface set on the context. And that is bad, as we don't get the TCP context allocated then.
Could you check if the following patch helps?
diff --git a/subsys/net/ip/net_context.c b/subsys/net/ip/net_context.c
index ae7a4a4e848..fb6e1a51237 100644
--- a/subsys/net/ip/net_context.c
+++ b/subsys/net/ip/net_context.c
@@ -542,7 +542,7 @@ int net_context_get(sa_family_t family, enum net_sock_type type, uint16_t proto,
return ret;
}
- net_context_set_iface(*context, net_if_get_default());
+ net_context_bind_iface(*context, net_if_get_default());
}
return 0;
The change just makes sure that the current net_context is permanently bound to the interface, which should be the case of offloaded contexts.
Ok, after another hour or so of bashing my head against this I've discovered the context inside net_context_connect literally changes to iface 3 (GSM PPP) after running the function bind_default, see the log below:
I've been investigating the case and was going to ask about this, it seems that with current implementation if
net_context_bind()
for whatever reason decides that non-offloaded interface is a better choice to communicate with the remote, it'd overwrite the interface set on the context. And that is bad, as we don't get the TCP context allocated then.Could you check if the following patch helps?
diff --git a/subsys/net/ip/net_context.c b/subsys/net/ip/net_context.c index ae7a4a4e848..fb6e1a51237 100644 --- a/subsys/net/ip/net_context.c +++ b/subsys/net/ip/net_context.c @@ -542,7 +542,7 @@ int net_context_get(sa_family_t family, enum net_sock_type type, uint16_t proto, return ret; } - net_context_set_iface(*context, net_if_get_default()); + net_context_bind_iface(*context, net_if_get_default()); } return 0;
The change just makes sure that the current net_context is permanently bound to the interface, which should be the case of offloaded contexts.
Hi Robert,
You are a true legend, this fix does appear to work!!!!
See log below
[00:00:57.028,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:00:57.128,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:00:57.426,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:02.668,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:02.768,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:03.066,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:08.308,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:08.408,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:08.706,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:09.506,000] <inf> lib_sim7070: Bringing interfaces up! Successfully started PPP
[00:01:09.515,000] <inf> lib_sim7070: AT channel 2 connected to GSM_2
[00:01:13.948,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:14.048,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:14.346,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:14.588,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:01:14.688,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:14.788,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:45924 iface 3 (0x80092360)
[00:01:15.819,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:15.832,000] <inf> COR_NET_MAIN: Interface 'Cellular' has started working
[00:01:17.908,000] <dbg> net_ctx: net_context_unref: (tcp_work): Context 0x80025480 released
[00:01:20.832,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:01:20.932,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:21.258,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:21.500,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:01:21.600,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:01:21.700,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:45845 iface 3 (0x80092360)
[00:01:22.635,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:01:24.490,000] <dbg> net_ctx: net_context_unref: (tcp_work): Context 0x80025480 released
However I did just test that patch with ethernet and plugging that in stops PPP from working and ethernet does not work:
[00:07:37.071,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:37.172,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:37.428,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:07:37.670,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:07:37.770,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:37.870,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:34145 iface 3 (0x80092360)
[00:07:38.646,000] <inf> eth_mcux: ETH_0 enabled 10M full-duplex mode.
[00:07:38.649,000] <dbg> net_ctx: net_context_unref: (rx_q[0]): Context 0x800253e0 released
[00:07:38.649,000] <dbg> net_ctx: net_context_bind: (rx_q[0]): Context 0x800253e0 binding to UDP 0.0.0.0:42058 iface 3 (0x80092360)
[00:07:38.651,000] <inf> net_dhcpv4: Received: 10.0.4.99
[00:07:38.652,000] <inf> net_config: IPv4 address: 10.0.4.99
[00:07:38.652,000] <inf> net_config: Lease time: 172800 seconds
[00:07:38.652,000] <inf> net_config: Subnet: 255.255.255.0
[00:07:38.652,000] <inf> net_config: Router: 10.0.4.254
[00:07:40.469,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:40.469,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:40.469,000] <wrn> COR_NET_MAIN: Interface 'Cellular' is up but has stopped working
[00:07:45.469,000] <dbg> COR_NET_MAIN: testInterface: Test Ethernet (0x80092260)
[00:07:45.570,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:45.670,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:34096 iface 3 (0x80092360)
[00:07:48.107,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:48.108,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:48.108,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:48.208,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:48.456,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:07:48.698,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:07:48.798,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:48.898,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:58356 iface 3 (0x80092360)
[00:07:51.766,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:51.766,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:56.766,000] <dbg> COR_NET_MAIN: testInterface: Test Ethernet (0x80092260)
[00:07:56.867,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:56.967,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:56685 iface 3 (0x80092360)
[00:07:59.494,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
[00:07:59.495,000] <dbg> net_ctx: net_context_unref: (cor_net_main_initializeThread): Context 0x80025480 released
[00:07:59.495,000] <dbg> COR_NET_MAIN: testInterface: Test Wi-Fi (0x800922e0)
[00:07:59.595,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:07:59.964,000] <dbg> COR_NET_MAIN: testInterface: Connect socket 0
[00:08:00.205,000] <dbg> COR_NET_MAIN: testInterface: Test Cellular (0x80092360)
[00:08:00.306,000] <dbg> COR_NET_MAIN: testInterface: Create socket 0
[00:08:00.406,000] <dbg> net_ctx: net_context_bind: (cor_net_main_initializeThread): Context 0x80025480 binding to TCP 0.0.0.0:62564 iface 3 (0x80092360)
[00:08:03.333,000] <dbg> COR_NET_MAIN: testInterface: Connect socket -1
Looks like it is trying to bind PPP when it tries to use ETH, so similar issue but slightly different
Looks like it is trying to bind PPP when it tries to use ETH, so similar issue but slightly different
Does the IP address remain registered on the PPP interface when it goes down? As before, on binding net stack will try to select best interface based on remote/local addresses, so if the "best match" remains on the PPP interface, it'd be kept being selected.
Or, you can also enforce that a specific "native" (non-offloaded) socket is bound to a specific interface by calling setsockopt()
with SO_BINDTODEVICE
socket option before connecting.
I did just test that patch with ethernet and plugging that in stops PPP from working
That sounds quite fishy, I don't see any obvious reason why would that be happening...
setsockopt
Now this is interesting, thank you so much for pointing out this socket option, setting this does seem to make the setup work even without the patch you described. Still testing but early signs are promising...
Can confirm setting the default interface and using the socket option SO_BINDTODEVICE
selecting between my 3 network interfaces appears to work reliably, thank you so much for your help Robert and Jukka
Can confirm setting the default interface and using the socket option
SO_BINDTODEVICE
selecting between my 3 network interfaces appears to work reliably, thank you so much for your help Robert and Jukka
You are welcome, I am glad the issue is sorted. If everything is working ok, please close the issue and the PR.