liburing icon indicating copy to clipboard operation
liburing copied to clipboard

io-wait used 100% cpu after 6.1.39 and 6.5.1

Open beldzhang opened this issue 2 years ago • 23 comments

after this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.1.39&id=f32dfc802e8733028088edf54499d5669cb0ef69 a running io-uring object will cause one cpu showing 100% usage of io-wait, in my environment 8 rings/threads make 8 cpu got 100% io-wait usage. beside this nothing else found.

try to revert this commit from 6.1.39 then everything is ok. 6.1.51/6.5.1 it is the same, can not direct revert this commit so not tested yet.

all following test got the same performance result: 6.1.38 6.1.39 6.1.39(reverted) 6.1.51

re-produce: use this testing program: https://github.com/axboe/liburing/files/9571382/issue-643-v2.zip, build and just run server, it will showing io-wait 100% on one cpu.

beldzhang avatar Sep 04 '23 19:09 beldzhang

isn't IO wait time is what you want to see? That is time app is blocked waiting on io (at least it was the case for file IO), but CPU has nothing else to do, so that time is accounted as IO wait.

redbaron avatar Sep 05 '23 10:09 redbaron

Note that io-wait doesn't burn CPU cycles, it's sleeping and so it's not a problem apart from reporting. There was a change doing that, and I think it actually makes more sense reporting a task waiting for io_uring completions as io-wait

isilence avatar Sep 05 '23 14:09 isilence

Yes this is expected. iowait literally just means "waiting on IO", which is what the task is doing. It does NOT mean that it's busy 100% of the time, in fact if you have 100% iowait it means 1 task is sleeping waiting on IO 100% of the time.

axboe avatar Sep 05 '23 15:09 axboe

Yes this is expected. iowait literally just means "waiting on IO", which is what the task is doing. It does NOT mean that it's busy 100% of the time, in fact if you have 100% iowait it means 1 task is sleeping waiting on IO 100% of the time.

noticed that, when this happen, system load is only 0.0x

we are using io-uring in a storage service, so is very sensitive about the storage/network load, previously io-wait is a good indicator to check this. and before kernel 5.0, iostat's %util value can be used also, but after 5.0, this number will easily up to 100% even on a small load. many articles also said this number is not reliable. is there any other way to check the disk load?

and I calculate a performance score after each test based on total cpu usage, just ignore the io-wait part looks like not a good solution...

beldzhang avatar Sep 05 '23 20:09 beldzhang

looks no more comments, closed.

beldzhang avatar Sep 13 '23 01:09 beldzhang

@axboe, one of Dragonfly's users, also reported this as a behavioural change: Dragonfly, which does not use disk IO, bumps up the IOWAIT metric to 100%. If they run it using the epoll API, it does not affect IOWAIT. I am just double checking whether this change is indeed intended.

romange avatar Nov 16 '23 07:11 romange

I see same issue with Kernel 6.5.11 (on Proxmox) image

rickytato avatar Nov 25 '23 16:11 rickytato

emmm... reopen?

beldzhang avatar Dec 05 '23 07:12 beldzhang

IOwait has been traditionally thought of as "waiting for disk io", which will always complete. Since io_uring can be used to wait on the network, which has unbounded waiting time, it changes the metric considerably. For example, many monitoring systems have alerts for iowait being high, correctly or not assuming it to be a proxy for disk contention.

RX14 avatar Feb 22 '24 16:02 RX14

Here's what I think we should do:

  1. Default to not using iowait, as it is indeed somewhat confusing for networked or mixed network/storage workloads. I do think iowait is an awful metric that makes very little sense for async workloads, even just pure storage based ones. Lots of consumers will assume it's busy time, or has a direct correlation with disk usage, which is just wrong.
  2. Add an IORING_ENTER_IOWAIT flag that can be used in conjunction with IORING_ENTER_GETEVENTS. If set, iowait will be used. Storage can use this, if they so wish.
  3. Add an IORING_FEAT_IOWAIT flag, which tells the app/liburing that this feature is available.
  4. Add liburing helpers, ala io_uring_set_iowait() and io_uring_clear_iowait(), which can be used to toggle this flag, if IORING_FEAT_IOWAIT is set. Storage based workloads can set this.

And that should be it. That gives the app control over whether iowait should be used or not.

axboe avatar Feb 22 '24 17:02 axboe

will test when avaliable

beldzhang avatar Feb 22 '24 19:02 beldzhang

@axboe, I just mentioned it in the mailing list, but even though I don't understand why people are taken aback by hi iowait from io_uring waiting, but I think we should just revert that change, there has been too many reports from different people regarding this one. We should be able to do the optimisation that was the reason for the change without reporting iowait.

isilence avatar Feb 24 '24 15:02 isilence

We can't just revert it, as it solved a real problem. I have my doubts that we can separate the cpufreq side from iowait in a way that would make the scheduler side happy. If we can, I'd be all for it, and would love to see a patch.

axboe avatar Feb 24 '24 18:02 axboe

A great discussion about this topic on lore.kernel.org.

Just for us to understand, once we call io_uring_register_iowait, it will flag networking I/O as iowait but iouring will run in more efficient manner?

Another interesting comment I read is about multiple rings. Currently https://github.com/romange/helio has ring-per-thread architecture. @axboe are you saying that sometimes it makes sense to have two rings? For what use-cases it makes sense?

romange avatar Mar 15 '24 07:03 romange

A great discussion about this topic on lore.kernel.org.

Just for us to understand, once we call io_uring_register_iowait, it will flag networking I/O as iowait but iouring will run in more efficient manner?

The long story. There is a patch upstream since a while ago which does two unrelated things: first it enables some cpu governor optimisation useful for QD1 and not only, and it also changes the io-wait stat behaviour as per this thread. They're coupled together for implementation reasons, it's much easier going this way. So, the optimisation is already in the kernel and always enabled, let's say it's a free lunch. Now, that io_uring_register_iowait() patch would disable the optimisation by default and turn it back on only if you call the function.

I have to say that it's quite a horrendous approach, having side effects from seemingly an optimisation, mixing responsibilities and levels at what the feature enabled and the iowait stat is observed, and so on. I think the register_iowait patch should never be given the light, at least as far as it mixes things together.

isilence avatar Mar 15 '24 14:03 isilence

Another interesting comment I read is about multiple rings. Currently https://github.com/romange/helio has ring-per-thread architecture. @axboe are you saying that sometimes it makes sense to have two rings? For what use-cases it makes sense?

IMHO, it doesn't make sense apart maybe from some IOPOLL + normal ring weird cases. However, sometimes it happens (unfortunately). For instance, when a library / framework you use has some io_uring support inside, and then the app creates another rings for its own purposes.

isilence avatar Mar 15 '24 14:03 isilence

There will be no register iowait, the current pending fixes are here:

https://git.kernel.dk/cgit/linux/log/?h=iowait.2

and will be posted for review soon, so they can get into the 6.10 kernel.

axboe avatar Mar 15 '24 15:03 axboe

https://git.kernel.dk/cgit/linux/log/?h=iowait.2

@axboe brief tested, io-wait is gone, will following up. mailing list followed also.

@isilence to the end users, they are sensitive to the latency and high response time of server, to sys admin, io-wait and load are directly showing the situations. generally storage parts is slowest in whole system, user/admin didn't care about the waiting of io read/write is sync or async, they just want to know how much loading of the entire server I already remove iostat %util displaying because it's non-sense anymore. but 100% iowait of io-uring on cpu, is terrify a lot of users/admins.

beldzhang avatar Mar 16 '24 15:03 beldzhang

ready for testing, for-6.10/io_uring? for-6.10/block or for-next? thanks.

beldzhang avatar Apr 19 '24 03:04 beldzhang

emmm.... any updates?

beldzhang avatar Jun 13 '24 15:06 beldzhang

still happening for us too

solarvm avatar Jul 22 '24 11:07 solarvm

Nothing was merged yet, as it's a low priority reporting issue. However, there is interest in that for some other reasons, and it's in the backlog. will get picked up hopefully soon.

isilence avatar Jul 22 '24 12:07 isilence