htop too slow on large server
When I try using htop on a server with around 200 cores and 2TB of memory (H100x8 server) htop is super super slow. It shows black screen for about 60s before showing anything on the screen. I tried the 3.0.3 version as well as the latest master from source. I tried disabling the patch from this issue: #1484 and it didn't help.
Server has 383K threads which is a lot ;)
Is there anything I can do to debug the slowness?
Can you press Shift+H (Toggle userland thread display) or Shift+K (Toggle kernel thread display) and check if this helps?
Also disabling the "library size" column, mentioned in #1484 may help (basically, everything that avoids reading the maps file). Also disabling the check for outdated binaries should help.
Finally, with that many processes in place, I'm not quite sure how the tree sorting performs. You can toggle this by pressing T if necessary.
If these steps don't help it would be nice to take a look at a flame graph using perf or some other tool like callgrind.
Shift+H helped the most. The refresh rate became close 6 sec. It is still very laggy and not nice to use especially because you have to wait long time for it to start. I could not find the "library size" column I disabled the outdated binaries feature.
I was not using tree sort by default enabling it did not change much. htop now uses around 50+% cpu from one core. But still quite laggy. I captured perf but it looks like it has no debug symbols.
Anything else I can try? How can I start in the Shift+H mode?
Without debug symbols the perf data is kinda hard to work with. Could you try to compile htop from source?
$ ./autogen.sh
$ ./configure CFLAGS="-Og -g"
$ make
You can then start ./htop from the build directory. More details on the build can be found in the readme.
But based on your reported success with Shift+H this hints at the sheer number of threads (and processes) taking some time to be read. Once reading userland threads is disabled, you can just close htop with q and it's saved to the settings. The long delay at the start should usually be better there too.
That initial blank screen comes from htop needing to refresh the process list twice.
Our htop is also very slow. In terms of Ubuntus, I started noticing it on Ubuntu 24 (which has htop 3.3.0) vs 22.04 (htop 3.0.5). It seems directly related to the amount of processes. We unfortunately have servers with thousands of processes, which makes the list super slow, even if the server does nothing. Htop it self is then always at the top then, with near 100% CPU.
Hiding threads is merely a work-around if the list is long because of threads, instead of processes.
The slowness does not happen as non-root. It's fast then.
perf top as root:
20,28% [kernel] [k] mangle_path
12,72% [kernel] [k] seq_put_hex_ll
5,57% [kernel] [k] prepend_path
4,37% [kernel] [k] m_next
3,51% [kernel] [k] strchr
3,32% [kernel] [k] show_map_vma
3,23% [kernel] [k] show_vma_header_prefix
3,21% [kernel] [k] prepend
2,21% [kernel] [k] seq_putc
1,68% [kernel] [k] copy_from_kernel_nofault
1,61% [kernel] [k] seq_read_iter
1,43% [kernel] [k] d_path
1,42% [kernel] [k] srso_alias_safe_ret
1,21% [kernel] [k] prepend_copy
1,09% [kernel] [k] seq_path
1,09% [kernel] [k] seq_pad
0,98% [kernel] [k] __fdget_pos
0,94% [kernel] [k] memset_orig
0,90% [kernel] [k] num_to_str
0,86% [kernel] [k] mas_find
0,82% [kernel] [k] mas_next_slot
0,77% [kernel] [k] seq_put_decimal
These are all kernel threads, so I imagine you can guess where this comes from when running as root (despite not having debug symbols)?
Perf top as non root:
4,23% htop [.] Process_compare
2,52% [kernel] [k] pid_revalidate
2,38% [kernel] [k] entry_SYSCALL_64
2,21% [kernel] [k] seq_put_decimal_ull_width
1,93% [kernel] [k] memset_orig
1,79% [kernel] [k] num_to_str
1,75% [kernel] [k] srso_alias_safe_ret
1,65% [kernel] [k] vsnprintf
1,64% [kernel] [k] __fdget_pos
1,57% htop [.] Process_compareByKey_Base
1,57% [kernel] [k] kmem_cache_alloc
1,40% [kernel] [k] format_decode
1,40% [kernel] [k] dput
1,38% [kernel] [k] seq_puts
1,37% [kernel] [k] __d_lookup_rcu
1,28% [kernel] [k] seq_put_decimal_ull
1,24% [kernel] [k] _raw_spin_lock
1,19% [kernel] [k] file_close_fd
1,16% [kernel] [k] task_dump_owner
1,15% [kernel] [k] number
1,15% [kernel] [k] apparmor_inode_getattr
1,09% [kernel] [k] render_sigset_t
1,03% htop [.] 0x00000000000436f7
1,01% [kernel] [k] syscall_exit_to_user_mode
0,97% [kernel] [k] __fput
0,93% [kernel] [k] put_dec_trunc8
0,92% [kernel] [k] do_syscall_64
0,91% [kernel] [k] task_state
0,89% [kernel] [k] hook_file_open
0,87% [kernel] [k] __cond_resched
0,87% [kernel] [k] seq_read_iter
0,76% [kernel] [k] x64_sys_call
BTW: btop implements lazy sort. That may also be helpful for large process lists. If it has detected it has spent more than x ms on sorting, abort. Large process lists have been causing htop to be slow for longer.
Thank you for these perf traces and the hint about the root/non-root performance difference. Though on their own these information only give a very broad sense, where the slowness may arise from. Looking at the function names in these traces this seems to point at some of the additional files we parse as root, but given the function names, this seems to be a kernel-side issue with generating the files in procfs; not with htop itself, as htop doesn't use libc implementations for string-to-integer parsing with most files. Also the names in the root-user dump point at converting numbers to strings, thus the opposite direction from what htop mostly needs to do.
Can you take a look at generating flame graphs of htop scanning for processes intermixed with the kernel stacks? TIA.
Can you reproduce it with:
#!/bin/bash
pids=()
sleep_me()
{
sleep 300
}
for s in $(seq 0 5000); do
sleep_me & pids+=($!)
done
for p in "${pids[@]}"; do
echo "Waiting for pid $p"
wait "$p"
done
If so, that seems to be easier for debugging.
With your forkbomb script running:
Samples: 150K of event 'cycles', 4000 Hz, Event count (approx.): 26606134128 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
15,90% htop [.] Hashtable_isConsistent
7,40% htop [.] Vector_indexOf
5,77% [kernel] [k] mangle_path
4,64% htop [.] Row_idEqualCompare
3,68% [kernel] [k] seq_put_hex_ll
2,77% [kernel] [k] show_map_vma
1,99% [kernel] [k] show_map
1,90% [kernel] [k] pid_revalidate
1,74% [kernel] [k] show_vma_header_prefix
1,71% htop [.] LinuxProcessTable_recurseProcTree.isra.0
1,64% [kernel] [k] srso_alias_return_thunk
1,59% [kernel] [k] srso_alias_safe_ret
1,40% [kernel] [k] prepend_path
1,16% [kernel] [k] seq_read_iter
1,14% [kernel] [k] filldir64
0,94% [kernel] [k] do_task_stat
0,88% htop [.] Hashtable_get
...
@fasterit Hashtable_isConsistent only exists under #ifndef NDEBUG - worth trying a non-debug build here?
I wonder where the mangle_path stuff gets called from in the kernel. A wild guess would be something iterating directories, which would also fit with filldir64.
Another observation is the presence of show_map and show_map_vma, which may hint at the memory map reading code.
One more reason for a full flamegraph of this.
Thank you @natoscott. This is what a non-debug build gets:
Samples: 101K of event 'cycles', 4000 Hz, Event count (approx.): 18467791692 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
8,79% [kernel] [k] mangle_path
5,42% [kernel] [k] seq_put_hex_ll
4,15% [kernel] [k] show_map_vma
3,17% [kernel] [k] pid_revalidate
3,13% [kernel] [k] show_map
2,99% [kernel] [k] show_vma_header_prefix
2,77% htop [.] LinuxProcessTable_recurseProcTree.isra.0
2,32% [kernel] [k] srso_alias_return_thunk
2,30% [kernel] [k] prepend_path
2,29% [kernel] [k] srso_alias_safe_ret
1,87% [kernel] [k] seq_read_iter
1,71% [kernel] [k] filldir64
1,49% [kernel] [k] do_task_stat
1,42% [kernel] [k] seq_put_decimal_ull_width
1,36% [kernel] [k] num_to_str
1,32% [kernel] [k] copy_from_kernel_nofault
1,25% [kernel] [k] entry_SYSCALL_64
1,23% [kernel] [k] strchr
0,93% [kernel] [k] path_openat
0,93% [kernel] [k] try_to_unlazy
0,90% [kernel] [k] kmem_cache_alloc_noprof
0,90% [kernel] [k] prepend_copy
0,87% [kernel] [k] __legitimize_path
0,83% [kernel] [k] tomoyo_check_open_permission
0,81% [kernel] [k] syscall_exit_to_user_mode
0,81% [kernel] [k] __memcg_slab_post_alloc_hook
0,80% [kernel] [k] __memset
0,79% [kernel] [k] do_syscall_64
0,74% [kernel] [k] __fput
0,73% [kernel] [k] __memcg_slab_free_hook
0,72% [kernel] [k] d_path
0,71% [kernel] [k] m_next
0,67% [kernel] [k] entry_SYSCALL_64_after_hwframe
0,65% libc.so.6 [.] __vfscanf_internal
0,61% [kernel] [k] __check_object_size
0,60% [kernel] [k] m_stop
0,59% [kernel] [k] _copy_to_user
0,58% [kernel] [k] seq_path
0,58% [kernel] [k] put_dec_trunc8
0,56% [kernel] [k] do_dentry_open
0,55% [kernel] [k] seq_putc
0,54% [kernel] [k] dput
0,52% [kernel] [k] obj_cgroup_charge
0,50% [kernel] [k] kmem_cache_free
0,47% [kernel] [k] put_dec
0,43% [kernel] [k] do_sys_openat2
0,43% [kernel] [k] mas_find
0,42% [kernel] [k] _copy_to_iter
0,42% [kernel] [k] vfs_read
0,39% [kernel] [k] process_measurement
0,37% [kernel] [k] m_start
0,37% [kernel] [k] mas_next_slot
0,37% [kernel] [k] alloc_fd
0,36% [kernel] [k] seq_pad
0,36% [kernel] [k] number
0,35% [kernel] [k] mod_objcg_state
0,35% [kernel] [k] get_pid_task
0,34% [kernel] [k] security_file_open
0,32% [kernel] [k] security_file_alloc
0,32% [kernel] [k] seq_read
0,32% [kernel] [k] obj_cgroup_uncharge_pages
0,32% [kernel] [k] kfree
0,31% [kernel] [k] path_init
0,31% [kernel] [k] __rcu_read_unlock
0,30% htop [.] Hashtable_get
0,30% [kernel] [k] security_file_permission
0,30% [kernel] [k] task_dump_owner
0,29% libc.so.6 [.] _IO_fgets
0,29% [kernel] [k] inode_permission
0,29% [kernel] [k] link_path_walk.part.0.constprop.0
0,29% [kernel] [k] generic_permission
0,28% [kernel] [k] strncpy_from_user
0,28% [kernel] [k] ksys_read
0,27% [kernel] [k] entry_SYSRETQ_unsafe_stack
0,27% [kernel] [k] security_file_release
0,26% [kernel] [k] do_filp_open
0,26% [kernel] [k] proc_pid_permission
0,26% [kernel] [k] alloc_empty_file
0,25% libc.so.6 [.] _IO_getline_info
0,25% [kernel] [k] __x64_sys_close
0,25% [kernel] [k] lookup_fast
0,25% [kernel] [k] mod_memcg_state
0,25% [kernel] [k] filp_flush
0,24% [kernel] [k] aa_file_perm
0,24% [kernel] [k] mntput_no_expire
0,23% libc.so.6 [.] read
0,23% [kernel] [k] __kmalloc_node_noprof
0,23% [kernel] [k] syscall_return_via_sysret
0,22% [kernel] [k] file_close_fd
0,22% [kernel] [k] __ptrace_may_access
...
@BenBE here you go:
(uploaded as .svg.gz so you can download, unpack and use the interactive features)
The culprit seems to be somewhere in the procfs code in the handling of show_map_vma. This corresponds to any read of the contents in /proc/<pid>/maps (or the task's ID remaps in /proc/<pid>/task/<tid>/maps).
The only place these files are referenced are within LinuxProcessTable_readMaps which is called if (any):
- Highlighting of deleted files/executables is active (default IIRC)
- The
PROCESS_FLAG_LINUX_LRS_FIXflag is set internally, which is the case if the columnM_LRSis part of the displayed columns
Disabling the highlighting of deleted executables and removing the M_LRS column from display should thus show a huge impact when running as root (who can read most of these files, thus causing a huge performance impact).
On the userland side cf. #768 …