zfs New taskq stats

Motivation and Context

I keep trying to improve performance by looking for bottlenecks with external tools, and I find something, and fix it, only to find that that thing goes away, but performance doesn't change, or changes in ways I don't expect. So I spend more and more time wondering what even is happening inside some of these black boxes. Then I decided I'd had enough, and made a start on bringing those mysteries out into the light.

So here's round 1: taskq stats. They're at the heart of so many things, and I want to know what they're up to.

Description

This gives every taskq three sets of stats:

gauges (up/down counts) of current state: thread count and activity, task count and activity
counters of events since creation: threads created/destroyed, tasks dispatched/executed/delayed, thread wakes & sleeps
min/max and rolling averages of time tasks have spent in each "stage" of the pipeline (enqueue/wait/dequeue/execute)

Each taskq gets its own kstat node in /proc/spl/kstat/taskq:

# ls /proc/spl/kstat/taskq
arc_prune.0	     summary	    z_null_int.0	z_upgrade.0
dbu_evict.0	     z_cl_int.0     z_null_iss.0	z_vdev_file.0
dp_sync_taskq.0       z_cl_iss.0     z_prefetch.0	z_wr_int.0
dp_zil_clean_taskq.0  z_flush_int.0  z_rd_int.0		z_wr_iss.0
spl_delay_taskq.0     z_flush_iss.0  z_rd_iss.0		z_zrele.0
spl_dynamic_taskq.0   z_fr_int.0     z_trim_int.0	z_zvol.0
spl_kmem_cache.0      z_fr_iss.0     z_trim_iss.0	zvol_tq-0.0
spl_system_taskq.0    z_metaslab.0   z_unlinked_drain.0

Within, we find all sorts of useful things:

$ cat /proc/spl/kstat/taskq/dp_sync_taskq.0
197 1 0x01 53 14416 1290432846656093 1291816178703629
name                            type data
threads_max                     4    8
entry_pool_min                  4    8
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    8
threads_total                   4    8
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    8
threads_created                 4    8
threads_destroyed               4    0
tasks_dispatched                4    175270
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    173948
tasks_executed_priority         4    1322
tasks_executed                  4    175270
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    126095
thread_wakeups_nowork           4    78498
thread_sleeps                   4    126103
enqueue_time_min                4    0
enqueue_time_max                4    38
enqueue_time_avg_10             4    5
enqueue_time_avg_100            4    5
enqueue_time_avg_1000           4    4
enqueue_time_avg_10000          4    0
wait_time_min                   4    0
wait_time_max                   4    42347
wait_time_avg_10                4    2
wait_time_avg_100               4    10
wait_time_avg_1000              4    1487
wait_time_avg_10000             4    3102
dequeue_time_min                4    0
dequeue_time_max                4    45
dequeue_time_avg_10             4    0
dequeue_time_avg_100            4    0
dequeue_time_avg_1000           4    1
dequeue_time_avg_10000          4    0
execute_time_min                4    0
execute_time_max                4    44618
execute_time_avg_10             4    0
execute_time_avg_100            4    2
execute_time_avg_1000           4    207
execute_time_avg_10000          4    441
task_time_min                   4    2
task_time_max                   4    44694
task_time_avg_10                4    8
task_time_avg_100               4    18
task_time_avg_1000              4    1698
task_time_avg_10000             4    3544

The meaning of most of these should be self-explanatory. The first set (down to entries_free) are current state; the second set (down to thread_sleeps) are event counters since the beginning.

The _time_ group is the most interesting, and probably the most controversial. Each entry (taskq_ent_t) now gets a timestamp at each "step" of its journey through the taskq:

when the dispatch function is called
when it is actually added to a queue
when a thread starts taking an entry off the queue
when the task in the entry is called
when the task completes

The delta between each step is then computed, and named: enqueue, wait, dequeue, execute.

For each type, the min/max delta since the epoch is recorded, and then the delta is mixed in to a moving average over the last 10, 100, 1000 and 10000 tasks. This makes it much easier to see the signs of, eg, queue lock contention (higher average enqueue/dequeue times), or too few worker threads (higher average wait times).

A "total" time from dispatch to completion is also recorded in the same way.

Finally, a summary list is included, showing all taskqs in the system, with a few key values for each. The idea is to give a "quick glance" state of the world at any given moment, and invite further digging:

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 8886444295 786666941903
                     | threads (current) | task on queue (current) | task time (µs)  (last 100 tasks)
taskq name           | tot [act idl] max |  pend [ norm  high] dly | avg    [enq    wait  deq   exec]
-------------------- | ----------------- | ----------------------- | --------------------------------
z_unlinked_drain.0   |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_zrele.0            |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
dp_zil_clean_taskq.0 |   4 [  0   4]   4 |     0 [    0     0]   0 |     10 [   4      4    0      1]
dp_sync_taskq.0      |   1 [  0   1]   1 |     0 [    0     0]   0 |    154 [   2     44    0    108]
z_upgrade.0          |   1 [  0   1]   4 |     0 [    0     0]   0 |     87 [   0      0    0     86]
z_prefetch.0         |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_metaslab.0         |   2 [  0   2]   2 |     0 [    0     0]   0 |     30 [   2     27    0      0]
...

Finally, this removes the old /proc/spl/taskq and taskq-all files.

Design and implementation notes

I'm aware of how busy taskqs get, and I've tried my best to not introduce additional bottlenecks. Gauge and counter stats are all regular wmsum_t types, and only read if the kstats are actually read from. The rolling averages are recorded once at the end of each task, and do not lock the taskq itself to do so, which can mean trampling but they are not intended to be accurate. I've thrown some of my performance test workloads at it (the ones that I'm trying improve) and these changes don't appear to move the needle, so I feel at least okay. about it. It'd be good to get more testing. Importantly, reading the stats does not take the taskq lock, since the stats are held separately.

The time stats have been useful already (a couple of interesting quirks discovered which I'll chase down later). They still don't quite feel right to me. I'm not sure if min/max is ever going to be useful. The averages undoubtedly are useful, but "last N tasks" is only useful during sustained operation, which many taskqs don't see. I'm actually not very good at math (moving average is my only trick :sweat_smile:), so I'm hoping someone will say "oh, this is a perfect use for Numberwang!"

I'm not super persuaded about the summary list. So far, the "threads" column has been unnecessary and the "tasks on queue" column has been great (though I can take or leave "delay"). The averages column should be the most interesting, but there's not a single granularity that works for all taskq types - high-throughput queues like z_wr and z_rd really need 10000, but rarely-used taskqs might be better at 10, and there's everything in between. I did think a bit about choosing the most likely range based on total tasks, but again, it's tricky without some notion of time, and taskqs don't have a heartbeat, and I would prefer to keep it that way.

Your ideas for other stats and presentation welcome, as I would like this to be actually good. Yes, I am inviting bikeshedding :innocent:

Removing the old stats might not be wanted. I personally don't see the point in keeping them; they show some different things but they're mostly about taskq configuration (flags etc) which aren't especially useful to the reader and can't be seen anyway. When they're busy, dumping tasks is quite pointless and doesn't seem to work well anyway. Finally, the implementation is pretty invasive, messing around inside the taskq itself. But if someone wants to keep them, it won't particularly hurt anything; just be mess.

Finally, I would like to do something similar for FreeBSD, but since we use a lot more of the native taskqueues there, it might be an effort to actually get anything useful. I wouldn't hold this PR just for that, but if you know more about this I would like to hear about it.

How Has This Been Tested?

A lot by hand, including performance tests. A full ZTS run has been completed, but since it doesn't use anything from /proc/spl/taskq I wasn't expecting it to show much and indeed, it did not.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[x] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[x] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

May 07 '24 05:05 robn

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it. (The only fun part might be referencing that code from inside the spl code, like why you can't easily use dbgmsg from in there.)

May 07 '24 12:05 rincebrain

@rincebrain

A counter for fpu_enter/fpu_exit (or just more general critical section enter/exit) calls in the taskq might also be informative, as would if your taskq triggers kmem cache growth, and I think you still have the current task in the relevant pieces of code to do it.

Hmm. I hadn't really thought much about counting what the task itself does; this was all about internal taskq behaviour.

But yes, in general I think it might be interesting to know when a task did something like block, sleep, be pre-empted, or otherwise make the thread unavailable to other tasks. I probably won't try it in this PR though.

(unless I'm completely misunderstanding you of course. If so, tell me, and say "well obviously you just tap your belt buckle thrice", and then I'll do the thing)

May 07 '24 23:05 robn

Conceptually, it didn't strike me as that different a thing to track, since you're monitoring stats about how much time the tasks spent enqueued/dequeuing/waiting/delayed, and I spend a bunch of time sometimes in that question, so "how many times/how long did I spend inside/waiting on fpu sections" and the like, naturally interested me.

But no, you got the idea.

May 08 '24 00:05 rincebrain

Last push removes the timing stats.

# cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 640838657 117862991354
                     | threads           | tasks on queue
taskq name           | tot [act idl] max |  pend [ norm  high] dly
-------------------- | ----------------- | -----------------------
z_vdev_file.0        |   1 [  0   1]  16 |     0 [    0     0]   0
dbu_evict.0          |   1 [  0   1]   1 |     0 [    0     0]   0
arc_prune.0          |   1 [  0   1]   1 |     0 [    0     0]   0
zvol_tq-0.0          |   1 [  0   1]  32 |     0 [    0     0]   0
spl_kmem_cache.0     |   1 [  0   1]   4 |     0 [    0     0]   0
spl_dynamic_taskq.0  |   1 [  0   1]   1 |     0 [    0     0]   0
spl_delay_taskq.0    |   1 [  0   1]   4 |     0 [    0     0]   0
spl_system_taskq.0   |   1 [  0   1]  64 |     0 [    0     0]   0

# cat /proc/spl/kstat/taskq/spl_kmem_cache.0
4 1 0x01 23 6256 640915734 240317768164
name                            type data
threads_max                     4    4
entry_pool_min                  4    32
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    1
threads_total                   4    1
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    32
threads_created                 4    1
threads_destroyed               4    0
tasks_dispatched                4    969
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    969
tasks_executed_priority         4    0
tasks_executed                  4    969
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    969
thread_wakeups_nowork           4    0
thread_sleeps                   4    970

I'll keep working on measuring throughput in a future PR; this one I think is good to go.

Jun 07 '24 23:06 robn