Change callbacks for wireless interfaces are wrong
I have been running through most of my testing with wired ethernet interfaces. Today I started testing with wireless interfaces. I am seeing some strange behavior that looks like a bug. I am using network manager on Linux Mint 17.1 to toggle the wireless interface on and off. Here is what I see from the object dumps in the change callbacks. It is starting from a state where wlan0 is initially disabled:
wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state down mode default carrier down
wlan0 ether <broadcast,multicast,up>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state dormant mode dormant carrier up
wlan0 ether <broadcast,multicast,up,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,running,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state up mode dormant carrier up
Then after about 3-4 seconds I end up with this:
wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
When I investigate the state of the interface using the 'ip' command I get this:
$ ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 80:fa:5b:06:25:50 brd ff:ff:ff:ff:ff:ff
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000
link/ether f8:16:54:2c:f3:6c brd ff:ff:ff:ff:ff:ff
So clearly something in linbl is not working correctly. Not sure what at this point but the cache is very suspect. I have a version I wrote using libnl sockets without the cache manager. I'll resurrect it and see if it is reproducible there as well but I am willing to bet that it is not. I tried running under valgrind to see if it could catch anything but the run produced the same result and no memory errors were reported. For what it is worth, this is 100% reproducible for me.
I think this needs some more details what you are actually doing.
Can you attach code to reproduce it? Thank you
This can be replicated using a bare bones example with cache manager and simply dumping out the cache object in the change callback for links. I will post a simple example shortly.
Ok here is the minimal program to reproduce the issue:
#include <netlink/netlink.h>
#include <netlink/cache.h>
#include <netlink/route/addr.h>
#include <netlink/route/link.h>
#include <signal.h>
//#include <netlink-private/cache-api.h>
static int quit = 0;
static struct nl_dump_params dp = {
.dp_type = NL_DUMP_DETAILS,
};
static void change_cb(struct nl_cache *cache, struct nl_object *obj,
int action, void *data)
{
if (action == NL_ACT_NEW)
printf("NEW ");
else if (action == NL_ACT_DEL)
printf("DEL ");
else if (action == NL_ACT_CHANGE)
printf("CHANGE ");
nl_object_dump(obj, &dp);
}
static void sigint(int arg)
{
quit = 1;
}
int main(int argc, char *argv[])
{
struct nl_cache_mngr *mngr;
struct nl_cache *cache;
int err, i;
struct rtnl_link* link = rtnl_link_alloc();
struct rtnl_addr* addr = rtnl_addr_alloc();
dp.dp_fd = stdout;
signal(SIGINT, sigint);
err = nl_cache_mngr_alloc(NULL, NETLINK_ROUTE, NL_AUTO_PROVIDE, &mngr);
if (err < 0) {
fprintf(stderr, "Unable to allocate cache manager: %s", nl_geterror(err));
exit(-1);
}
for (i = 1; i < argc; i++) {
err = nl_cache_mngr_add(mngr, argv[i], &change_cb, NULL, &cache);
if (err < 0) {
fprintf(stderr, "Unable to add cache %s: %s", argv[i], nl_geterror(err));
exit(-1);
}
}
while (!quit) {
int err = nl_cache_mngr_poll(mngr, 1000);
if (err < 0 && err != -NLE_INTR) {
fprintf(stderr, "Polling failed: %s", nl_geterror(err));
exit(-1);
}
//nl_cache_mngr_info(mngr, &dp);
}
rtnl_link_put(link);
rtnl_addr_put(addr);
nl_cache_mngr_free(mngr);
return 0;
}
Steps to reproduce:
- Go to the system tray and open Network Manager.
- Turn the wireless toggle to off and wait 10-20 seconds for that to propogate.
- Run the test executable as './main route/link'.
- Turn the wireless toggle back on and watch the output. You will see the corresponding output below. It will pause and look like everything is good but if you wait 20-40 seconds you will receive a notification showing the wlan0 interface is dead. You can verify that this is incorrect by running the 'ip link list' command. The interface is in fact there and quite healthy.
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state down mode default carrier down
CHANGE wlan0 ether <broadcast,multicast,up>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state dormant mode dormant carrier up
CHANGE wlan0 ether <broadcast,multicast,up,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,running,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state up mode dormant carrier up
CHANGE wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state down mode dormant carrier down
CHANGE wlan0 ether <broadcast,multicast,up,running>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state dormant mode dormant carrier up
CHANGE wlan0 ether <broadcast,multicast,up,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
CHANGE wlan0 ether f8:16:54:2c:f3:6c <broadcast,multicast,up,running,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc mq index 3
txq 4 rxq 1 brd ff:ff:ff:ff:ff:ff state up mode dormant carrier up
******** After 20 - 40 seconds you will get the change below ********
CHANGE wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
Another observation here is the fact that the interface flaps to this apparently default or uninitialized state a few times even while the interface is coming up. I have a suspicion that this is related and is not supposed to be happening.
If I add a bit of code to dump the cache each time a change callback is processed I can verify that in the end the cache itself contains the same values as the object passed to the callback. Pretty sure that they are one and the same but it is another worthy data point:
CHANGE wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
==== cache dump start ====
lo loopback <loopback,up,running,lowerup> group 0
mtu 65536 txqlen 0 weight 0 qdisc noqueue index 1
txq 1 rxq 1 brd 00:00:00:00:00:00 mode default carrier up
eth0 ether 80:fa:5b:06:25:50 <broadcast,multicast,up,running,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc pfifo_fast index 2
txq 1 rxq 1 brd ff:ff:ff:ff:ff:ff state up mode default carrier up
wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
==== cache dump end ====
Ok here is the trailing state transition where wlan0 goes to hell with debug turned on. It almost looks like the netlink message that came back from the kernel is garbage. However, why would the kernel return this to us and yet the 'ip link list' command is happily displaying the correct information?
DBG<3> cache_mngr.c:393 int nl_cache_mngr_poll(struct nl_cache_mngr *, int): Cache manager 0x15a0520, poll() returned 1
DBG<2> cache_mngr.c:430 int nl_cache_mngr_data_ready(struct nl_cache_mngr *): Cache manager 0x15a0520, reading new data from fd 3
DBG<3> nl.c:823 int recvmsgs(struct nl_sock *, struct nl_cb *): Attempting to read from 0x15a0640
DBG<3> nl.c:832 int recvmsgs(struct nl_sock *, struct nl_cb *): recvmsgs(0x15a0640): Read 64 bytes
DBG<3> nl.c:836 int recvmsgs(struct nl_sock *, struct nl_cb *): recvmsgs(0x15a0640): Processing valid message...
DBG<2> msg.c:282 struct nl_msg *__nlmsg_alloc(size_t): msg 0x15a69e0: Allocated new message, maxlen=64
DBG<2> cache_mngr.c:79 int event_input(struct nl_msg *, void *): Cache manager 0x15a0520, handling new message 0x15a69e0 as event
-------------------------- BEGIN NETLINK MESSAGE ---------------------------
[NETLINK HEADER] 16 octets
.nlmsg_len = 64
.type = 16 <route/link::new>
.flags = 0 <>
.seq = 0
.port = 0
[PAYLOAD] 16 octets
00 00 01 00 03 00 00 00 43 10 01 00 00 00 00 00 ........C.......
[ATTR 03] 6 octets
77 6c 61 6e 30 00 wlan0.
[PADDING] 2 octets
00 00 ..
[ATTR 11] 16 octets
10 00 19 8b 00 00 00 00 00 00 00 00 00 00 00 00 ................
--------------------------- END NETLINK MESSAGE ---------------------------
DBG<2> cache_mngr.c:101 int event_input(struct nl_msg *, void *): Associated message 0x15a69e0 to cache 0x15a0960
DBG<4> object.c:72 struct nl_object *nl_object_alloc(struct nl_object_ops *): Allocated new object 0x15a7e70
DBG<2> cache_mngr.c:50 int include_cb(struct nl_object *, struct nl_parser_param *): Including object 0x15a7e70 into cache 0x15a0960
DBG<3> cache.c:396 void nl_cache_get(struct nl_cache *): Incremented cache 0x15a0960 <route/link> reference count to 3
wlan0 ether <broadcast,multicast,up,running,lowerup>
DBG<3> cache.c:416 void nl_cache_free(struct nl_cache *): Decremented cache 0x15a0960 <route/link> reference count, 2 remaining
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
DBG<4> object.c:208 void nl_object_get(struct nl_object *): New reference to object 0x15a7480, total 3
DBG<4> object.c:222 void nl_object_put(struct nl_object *): Returned object reference 0x15a7480, 2 remaining
DBG<4> object.c:222 void nl_object_put(struct nl_object *): Returned object reference 0x15a7480, 1 remaining
DBG<2> cache.c:572 void nl_cache_remove(struct nl_object *): Deleted object 0x15a7480 from cache 0x15a0960 <route/link>.
DBG<3> cache.c:529 int nl_cache_move(struct nl_cache *, struct nl_object *): Moving object 0x15a7e70 from cache (nil) to cache 0x15a0960
DBG<4> object.c:208 void nl_object_get(struct nl_object *): New reference to object 0x15a7e70, total 2
DBG<4> object.c:208 void nl_object_get(struct nl_object *): New reference to object 0x15a7e70, total 3
DBG<3> cache.c:452 int __cache_add(struct nl_cache *, struct nl_object *): Added object 0x15a7e70 to cache 0x15a0960 <route/link>, nitems 3
CHANGE wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
DBG<4> object.c:222 void nl_object_put(struct nl_object *): Returned object reference 0x15a7480, 0 remaining
DBG<4> object.c:188 void nl_object_free(struct nl_object *): Freed object 0x15a7480
DBG<4> object.c:222 void nl_object_put(struct nl_object *): Returned object reference 0x15a7e70, 2 remaining
DBG<4> msg.c:565 void nlmsg_free(struct nl_msg *): Returned message reference 0x15a69e0, 0 remaining
DBG<2> msg.c:572 void nlmsg_free(struct nl_msg *): msg 0x15a69e0: Freed
DBG<2> cache_mngr.c:440 int nl_cache_mngr_data_ready(struct nl_cache_mngr *): Cache manager 0x15a0520, recvmsgs read 1 messages
DBG<3> nl.c:823 int recvmsgs(struct nl_sock *, struct nl_cb *): Attempting to read from 0x15a0640
DBG<4> nl.c:709 int nl_recv(struct nl_sock *, struct sockaddr_nl *, unsigned char **, struct ucred **): nl_sendmsg(0x15a0640): nl_recv() failed with 11 (Resource temporarily unavailable)
==== cache dump start ====
DBG<2> cache.c:1173 void nl_cache_dump_filter(struct nl_cache *, struct nl_dump_params *, struct nl_object *): Dumping cache 0x15a0960 <route/link> with filter (nil)
DBG<4> cache.c:1192 void nl_cache_dump_filter(struct nl_cache *, struct nl_dump_params *, struct nl_object *): Dumping object 0x15a6f90...
lo loopback <loopback,up,running,lowerup> group 0
mtu 65536 txqlen 0 weight 0 qdisc noqueue index 1
txq 1 rxq 1 brd 00:00:00:00:00:00 mode default carrier up
DBG<4> cache.c:1192 void nl_cache_dump_filter(struct nl_cache *, struct nl_dump_params *, struct nl_object *): Dumping object 0x15a6b20...
eth0 ether 80:fa:5b:06:25:50 <broadcast,multicast,up,running,lowerup> group 0
mtu 1500 txqlen 1000 weight 0 qdisc pfifo_fast index 2
txq 1 rxq 1 brd ff:ff:ff:ff:ff:ff state up mode default carrier up
DBG<4> cache.c:1192 void nl_cache_dump_filter(struct nl_cache *, struct nl_dump_params *, struct nl_object *): Dumping object 0x15a7e70...
wlan0 ether <broadcast,multicast,up,running,lowerup>
mtu 0 txqlen 0 weight 0 index 3
mode default carrier down
==== cache dump end ====
DBG<3> cache_mngr.c:391 int nl_cache_mngr_poll(struct nl_cache_mngr *, int): Cache manager 0x15a0520, poll() fd 3
I also see these messages printed regularly. Not sure if this is indicative of a problem of if this is a normal condition when dealing with the netlink socket.
DBG<2> cache_mngr.c:440 int nl_cache_mngr_data_ready(struct nl_cache_mngr *): Cache manager 0x15a0520, recvmsgs read 1 messages
DBG<3> nl.c:823 int recvmsgs(struct nl_sock *, struct nl_cb *): Attempting to read from 0x15a0640
DBG<4> nl.c:709 int nl_recv(struct nl_sock *, struct sockaddr_nl *, unsigned char **, struct ucred **): nl_sendmsg(0x15a0640): nl_recv() failed with 11 (Resource temporarily unavailable)
cache-manager, whenever receiving a new object, replaces the object inside the cache with the new version: https://github.com/thom311/libnl/blob/c242e25c452e855d05609e57484296ecaf4763b0/lib/cache.c#L784
It seems that the object that kernel sends, is mostly empty. So, this works as implemented.
I think the solution is to implement nl_object_update()/oo_update() for "struct rtnl_link" and let it merge data from the new object into the current one.