Segmentation fault with cpu hotplug
Take a vm (libvirt+qemu), eg Fedora 42, x86_64 and install htop. In this case it's:
[root@test0 ~]# htop --version
htop 3.4.1
Now hot plug some cpus. htop segfaults.
FATAL PROGRAM ERROR DETECTED
============================
Please check at https://htop.dev/issues whether this issue has already been reported.
If no similar issue has been reported before, please create a new issue with the following information:
- Your htop version: '3.4.1'
- Your OS and kernel version (uname -a)
- Your distribution and release (lsb_release -a)
- Likely steps to reproduce (How did it happen?)
- Backtrace of the issue (see below)
Error information:
------------------
A signal 11 (Segmentation fault) was received.
Setting information:
--------------------
htop_version=3.4.1;config_reader_min_version=3;fields=0 48 17 18 38 39 40 2 46 47 49 1;hide_kernel_threads=1;hide_userland_threads=0;hide_running_in_container=0;shadow_o
ther_users=0;show_thread_names=0;show_program_path=1;highlight_base_name=0;highlight_deleted_exe=1;shadow_distribution_path_prefix=0;highlight_megabytes=1;highlight_thre
ads=1;highlight_changes=0;highlight_changes_delay_secs=5;find_comm_in_cmdline=1;strip_exe_from_cmdline=1;show_merged_command=0;header_margin=1;screen_tabs=1;detailed_cpu_time=0;cpu_count_from_one=0;show_cpu_usage=1;show_cpu_frequency=0;show_cpu_temperature=0;degree_fahrenheit=0;show_cached_memory=1;update_process_names=0;account_guest_in_cpu_meter=0;color_scheme=0;enable_mouse=1;delay=15;hide_function_bar=0;topology_affinity=0;header_layout=two_50_50;column_meters_0=LeftCPUs Memory Swap;column_meter_modes_0=1 1 1;column_meters_1=RightCPUs Tasks LoadAverage Uptime;column_meter_modes_1=1 2 2 2;tree_view=0;sort_key=46;tree_sort_key=0;sort_direction=-1;tree_sort_direction=1;tree_view_always_by_pid=0;all_branches_collapsed=0;screen:Main=PID USER PRIORITY NICE M_VIRT M_RESIDENT M_SHARE STATE PERCENT_CPU PERCENT_MEM TIME Command;.sort_key=PERCENT_CPU;.tree_sort_key=PID;.tree_view_always_by_pid=0;.tree_view=0;.sort_direction=-1;.tree_sort_direction=1;.all_branches_collapsed=0;screen:I/O=PID USER IO_PRIORITY IO_RATE IO_READ_RATE IO_WRITE_RATE PERCENT_SWAP_DELAY PERCENT_IO_DELAY Command;.sort_key=IO_RATE;.tree_sort_key=PID;.tree_view_always_by_pid=0;.tree_view=0;.sort_direction=-1;.tree_sort_direction=1;.all_branches_collapsed=0;
Backtrace information:
----------------------
htop(CRT_handleSIGSEGV+0x131) [0x55a114c6a531]
/lib64/libc.so.6(+0x1a070) [0x7f1f6e3cf070]
htop(+0x2607) [0x55a114c5e607]
htop(Header_updateData+0x71) [0x55a114c6bbd1]
htop(ScreenManager_run+0x674) [0x55a114c827e4]
htop(CommandLine_run+0x8b8) [0x55a114c68fb8]
/lib64/libc.so.6(+0x3575) [0x7f1f6e3b8575]
/lib64/libc.so.6(__libc_start_main+0x88) [0x7f1f6e3b8628]
htop(_start+0x25) [0x55a114c5ddc5]
To make the above information more practical to work with, please also provide a disassembly of your htop binary. This can usually be done by running the following command:
objdump -d -S -w `which htop` > ~/htop.objdump
Please include the generated file in your report.
Running this program with debug symbols or inside a debugger may provide further insights.
Thank you for helping to improve htop!
Segmentation fault (core dumped)
It's 100% reproducible.
Note that hotunplugging causes no issue, you see those CPU's go "offline". Note that's not technically accurate. Offline vs. online is different than present/missing.
HTH Thanks!
Mind to include the objdump as noted in the crash message? TIA.
Mind to include the objdump as noted in the crash message? TIA.
Apologies, I didn't get to that. I (possibly incorrectly) assumed this would be an easy reproducer and you'd have a better situation on your own machine. If that's not the case, lmk, and I'll try to get a dump shortly.
While this should be fairly simple to reproduce, it's sometimes better to track the issue directly back from the binary as issues with seemingly identical symptoms sometimes may have different causes. Also having the objdump for the backtrace allows to group similar reports better. Another point in favour of objdumps is that it allows to reproduce the issue even without getting all the details right (e.g. removing a life CPU with one virtualization environment may slightly differ in behaviour from another), but still see the exact code path that was taken (and don't worry, we tracked down bugs from x64, arm, and mips assembly alone already). Also, given that most builds aren't usually debug builds, the backtrace alone skips some essential information, that you can reconstruct based off of the objdump (that's why you sometimes see in bug reports that an offset/alignment¹ is posted).
So yes, please post the objdump for the exact binary that backtrace/crash was triggered with. TIA.
¹Basically the module load offset for the code segment that you need to subtract from the backtrace in order to map addresses to the objdump. From there mapping back to functions mostly is about knowing the rough code structure and how the code usually is laid out by the optimization passes of modern compilers.
[root@test0 ~]# objdump -d -S -w which htop > ~/htop.objdump
objdump: Warning: source file /usr/include/bits/unistd.h is more recent than object file
objdump: Warning: source file /usr/include/bits/stdio2.h is more recent than object file
objdump: Warning: source file /usr/include/bits/string_fortified.h is more recent than object file
objdump: Warning: source file /usr/include/stdlib.h is more recent than object file
objdump: Warning: source file /usr/include/bits/fcntl2.h is more recent than object file
objdump: Warning: source file /usr/include/wchar.h is more recent than object file
objdump: Warning: source file /usr/include/bits/stdlib.h is more recent than object file
objdump: Warning: source file /usr/include/sys/sysmacros.h is more recent than object file
GITHUB:
File type .objdump not supported. See the documentation for supported file types.
gzipped...
Offset: 0x55a114c5c000
There is an assert(existing == currExisting); in LinuxMachine_updateCPUcount, that would normally hit in debug builds.
A quick and dirty hack could be done like this:
diff --git a/CPUMeter.c b/CPUMeter.c
index 69da88db..b32f281b 100644
--- a/CPUMeter.c
+++ b/CPUMeter.c
@@ -226,11 +226,18 @@ static void AllCPUsMeter_getRange(const Meter* this, int* start, int* count) {
}
}
+static void AllCPUsMeter_done(Meter* this);
+static void CPUMeterCommonInit(Meter* this);
+
static void AllCPUsMeter_updateValues(Meter* this) {
CPUMeterData* data = this->meterData;
Meter** meters = data->meters;
int start, count;
AllCPUsMeter_getRange(this, &start, &count);
+ if (data->cpus != (size_t)count) {
+ AllCPUsMeter_done(this);
+ CPUMeterCommonInit(this);
+ }
for (int i = 0; i < count; i++)
Meter_updateValues(meters[i]);
}
@@ -276,9 +283,7 @@ static void CPUMeterCommonUpdateMode(Meter* this, MeterModeId mode, int ncol) {
static void AllCPUsMeter_done(Meter* this) {
CPUMeterData* data = this->meterData;
Meter** meters = data->meters;
- int start, count;
- AllCPUsMeter_getRange(this, &start, &count);
- for (int i = 0; i < count; i++)
+ for (size_t i = 0; i < data->cpus; i++)
Meter_delete((Object*)meters[i]);
free(data->meters);
free(data);
diff --git a/linux/LinuxMachine.c b/linux/LinuxMachine.c
index 188358ef..ff768aaa 100644
--- a/linux/LinuxMachine.c
+++ b/linux/LinuxMachine.c
@@ -123,7 +123,7 @@ static void LinuxMachine_updateCPUcount(LinuxMachine* this) {
#endif
super->activeCPUs = active;
- assert(existing == currExisting);
super->existingCPUs = currExisting;
}
Not sure about possible side effects. That patch is absolutely not tested and I'm not yet sure if it hits all the spots required, because there are potentially some more places that aren't fully aware of CPU hotplugging.