liburing
liburing copied to clipboard
Unexpected io-wrk threads on tap device
To repro, see the code: https://gist.github.com/majek/152a719d13157749d1af279ce02ba954
One console:
$ bash r.sh
+ sudo ip tuntap del mode tap tap0
+ sudo ip tuntap add mode tap user marek group marek name tap0
+ sudo ip l set tap0 up
+ sudo ip a add 10.0.0.1/24 dev tap0
+ sudo ip neigh add 10.0.0.2 dev tap0 lladdr 00:11:22:33:44:55
+ gcc -g -Wall -Wextra tapwrkrepro.c -I ./liburing/src/include/ ./liburing/src/liburing.a
+ ./a.out
repro
Another console:
$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(14882)
Alarm clock
a.out(14882)─┬─{iou-wrk-14882}(14891)
├─{iou-wrk-14882}(14892)
├─{iou-wrk-14882}(14893)
├─{iou-wrk-14882}(14894)
...
Sometimes I need to run the python twice to get the result.
Analysis
We open a new fresh tap0, set it up, add 10.0.0.2 as neighbor to ignore ARP requests. On io_uring side we just recv2() the data from tap. The tap is set up in a way supporting GSO.
The python script just sends large GSO UDP pakcets towards 10.0.0.2 (tap).
I would expect the iou-wrk threads to not be started.
Kernel: 6.4.0-rc5+, fresh liburing.
I'll take a look! Note that from just glancing at your code, you should not be using errno anywhere. The liburing functions return -Exxxx directly. Eg:
struct io_uring ring;
r = io_uring_queue_init(buffers_num * 2, &ring, 0);
if (r != 0) {
error(-1, errno, "io_uring_queue_init");
}
here 'r' would be -EINVAL, for example, and errno would be any kind of garbage that someone last set. Not really related to the issue at hand, of course, just bringing it up.
Program just seems to sit there, and looking at traces, I see 4 completions:
a.out-760 [006] ..... 41.421959: io_uring_complete: ring 000000000a503dc9, req 00000000f78de966, user_data 0x0, result 90, cflags 0x0 extra1 0 extra2 0
a.out-760 [006] ..... 41.814024: io_uring_complete: ring 000000000a503dc9, req 000000005f48ce1e, user_data 0x1, result 90, cflags 0x0 extra1 0 extra2 0
a.out-760 [006] ..... 60.309103: io_uring_complete: ring 000000000a503dc9, req 00000000a776d355, user_data 0x2, result 70, cflags 0x0 extra1 0 extra2 0
a.out-760 [006] ..... 97.172334: io_uring_complete: ring 000000000a503dc9, req 00000000ad5c2465, user_data 0x3, result 70, cflags 0x0 extra1 0 extra2 0
Edit: and at this point, no async queueing or io-wq creation has happened. Edit2: ok I should read the full thing first...
Remembered to run the python side too, and one run I saw one io-wq (but that looked that was off cancel?) and for the rest I didn't see any io-wq queueing or activity. That's around 8650 completions, ala:
c8a269, req 000000001c161cfa, user_data 0x2e, result 65042, cflags 0x0 extra1 0 extra2 0
a.out-896 [004] ..... 400.931590: io_uring_complete: ring 000000003bc8a269, req 000000001c161cfa, user_data 0x2f, result 65042, cflags 0x0 extra1 0 extra2 0
a.out-896 [004] ..... 400.931597: io_uring_complete: ring 000000003bc8a269, req 000000001c161cfa, user_data 0x30, result 65042, cflags 0x0 extra1 0 extra2 0
The one spot where you might see it happening is if we get spuriously poll woken many times and there was no readiness. At some point io_uring will give up and punt to a worker and do blocking IO at that point. With a lot of concurrent readers, perhaps this is what is happening? Ran it a few more times and did not see any io-wq activity, full trace:
root@m1max-kvm:/sys/kernel/debug/tracing# cat trace|grep -c result
118535
root@m1max-kvm:/sys/kernel/debug/tracing# cat trace|grep -c submit
118596
root@m1max-kvm:/sys/kernel/debug/tracing# cat trace|grep -c async
0
Let me try and see if it's the spurious wakeup side. Here are the runs I just did:
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(869)
Alarm clock
a.out(869)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(883)
Alarm clock
a.out(883)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(896)
Alarm clock
a.out(896)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(932)
Alarm clock
a.out(932)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(949)
Alarm clock
a.out(949)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(949)
Alarm clock
a.out(949)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(949)
Alarm clock
a.out(949)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(949)
Alarm clock
a.out(949)
axboe@m1max-kvm:~/tap$ pstree -pt `pidof a.out`; python3 sendlong.py ; pstree -pt `pidof a.out`
a.out(949)
Alarm clock
a.out(949)─┬─{iou-wrk-949}(999)
├─{iou-wrk-949}(1000)
├─{iou-wrk-949}(1001)
└─{iou-wrk-949}(1002)
Yep it is the max retry, this is why it ends up going to io-wq. Let me poke a bit deeper... Seems to me like we try and re-arm before one has been processed, so we use up the retries way quicker than we should.
If you have time, can you try and apply this patch to your current kernel and see if that changes it for you?
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 6b9179e8228e..ac0d01c3be0c 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -526,7 +526,7 @@ static void __io_queue_proc(struct io_poll *poll, struct io_poll_table *pt,
poll->head = head;
poll->wait.private = (void *) wqe_private;
- if (poll->events & EPOLLEXCLUSIVE)
+ if (1 || poll->events & EPOLLEXCLUSIVE)
add_wait_queue_exclusive(head, &poll->wait);
else
add_wait_queue(head, &poll->wait);
@axboe the suggested patch fixes the issue.
Thanks for testing. The problem is that you have 61 requests pending for the same tap file. Initially they all end up polling the device, as there's no data there. Now some data comes in, and everybody waiting on POLLIN will get woken and retry. One of them will succeed, the others will go back to waiting on poll. If any single request gets woken more than X times and isn't able to get data, then it'll eventually fallback to a sync read as it thinks poll is faulty.
This would not happen if you had separate tap devices, or had fewer of them waiting. While we could improve this situation, it also seems a bit contrived to me. For any normal setup, I would expect to just have a single request pending for that specific file/tap device.
Assuming a single read request waiting on tap is voiding the "io_uring gives you batching" argument. It's 100% possible to have more than one packet delivered onto tap in one go with NAPI or GSO, so from application developer point of view it totally does make sense to have more than one buffer outstanding awaiting data from tap. This way you can, at least in theory, get data off tap faster.
Would it be possible to order the SQE read's with LINK?
I don't disagree with you, but I think you'll find it's going to be more efficient if you limit the number of requests inflight to eg 2 or 4. For your test case at least, once the initial requests are issued, you do submit_and_wait() for each one after that. No batching happening there, outside of doing an issue + wait in a single syscall (which is indeed useful!).
This is not just because you would avoid io-wq by doing less, you'd also get less poll wakeup iterations which I suspect will cost some cycles.
Ideally what would happen is that if you have N requests pending on the same poll queue, we'd do them one-by-one on wakeup and stop when there's no more data to read. But that's a bigger kind of project, though definitely would be an interesting one and I might take a look at that at some point.