elks icon indicating copy to clipboard operation
elks copied to clipboard

elksemu (sometimes) hangs running an ia16-gcc program

Open johnsonjh opened this issue 2 months ago • 5 comments

Greetings.

I was trying to port a program (errnum.c) to ELKS. The large majority of the program attempts to ensure that localized messages are always used in a way that is cross-platform aware, which is unnecessary on ELKS since there isn't any locale support, and neither does their exist strsignal, so in the end, the program ends up doing very little. The original program also used a 32K static buffer, but on ELKS we can safely assume that no error message will ever exceed 128 bytes.

Obviously, it would be better to just rewrite this program for ELKS, but, the minimal ELKS version is errnum_elks.c.

Running this program on Linux with elksemu results in random hangs at different points for me:

$ ia16-elf-gcc -melks errnum_elks.c -o errnum
$ while :; do seq 1 512 | xargs -I{} ~/src/build-ia16/build-elks/elksemu/elksemu errnum {}; done

It will always hang, and in different random spots.

I have a port of another program (I can provide it if need be) and this one, run running like:

while :; do elksemu ./mcmb -X mul 5 10 15; done

and it will run a few iterations and then elksemu will get a SIGSTOP:

[1]  + 2744420 suspended (signal)  elksemu ./mcmb -X mul 5 10 15

And then I can fg to continue it, and it will run millions of iterations without any pauses or hangs. But if I start the while loop over again, it will always get a SIGSTOP almost immediately, and require an fg to continue on.

I've not looked much further into the cause yet. I did try to run elksemu under strace on my system, it gets an immediate SIGSTOP every single time:

$ strace elksemu ./mcmb -X mul 5 10 15
execve("/home/jhj/src/build-ia16/build-elks/elksemu/elksemu", ["/home/jhj/src/build-ia16/build-e"..., "./mcmb", "-X", "mul", "5", "10", "15"], 0x7ffcb7f14e40
 /* 125 vars */) = 0                                                                                                                                         
brk(NULL)                               = 0x2db61000                                                                                                         
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x153bf19ac000                                                                    
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)                                                                              
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3                                                                                                 
fstat(3, {st_mode=S_IFREG|0644, st_size=334839, ...}) = 0                                                                                                    
mmap(NULL, 334839, PROT_READ, MAP_PRIVATE, 3, 0) = 0x153bf195a000                                                                                            
close(3)                                = 0                                                                                                                  
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3                                                                                                 
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\00007\0\0\0\0\0\0"..., 832) = 832                                                                      
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784                                                                
fstat(3, {st_mode=S_IFREG|0755, st_size=2447520, ...}) = 0                                                                                                   
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784                                                                
mmap(NULL, 2038832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x153bf1768000                                                                   
mmap(0x153bf18d7000, 479232, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16f000) = 0x153bf18d7000                                                   
mmap(0x153bf194c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e3000) = 0x153bf194c000                                         
mmap(0x153bf1952000, 31792, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x153bf1952000                                               
close(3)                                = 0                                   
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x153bf1765000                                                                   
arch_prctl(ARCH_SET_FS, 0x153bf1765740) = 0                                   
set_tid_address(0x153bf1765a10)         = 2830309                                                                                                            
set_robust_list(0x153bf1765a20, 24)     = 0                                                                                                                  
rseq(0x153bf1765680, 0x20, 0, 0x53053053) = 0                                                                                                                
mprotect(0x153bf194c000, 16384, PROT_READ) = 0                                                                                                               
mprotect(0x408000, 4096, PROT_READ)     = 0                                   
mprotect(0x153bf19e8000, 8192, PROT_READ) = 0                                                                                                                
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=RLIM64_INFINITY, rlim_max=RLIM64_INFINITY}) = 0                                                                   
munmap(0x153bf195a000, 334839)          = 0                                                                                                                  
access("./mcmb", X_OK)                  = 0                                                                                                                  
openat(AT_FDCWD, "./mcmb", O_RDONLY)    = 3                                                                                                                  
fstat(3, {st_mode=S_IFREG|0755, st_size=51776, ...}) = 0                                                                                                     
getuid()                                = 1000                                
getgid()                                = 1000                                                                                                               
setregid(1000, 1000)                    = 0                                                                                                                  
setreuid(1000, 1000)                    = 0                                                                                                                  
modify_ldt(0, 0x7ffcffabfeb0, 65536)    = 0                                                                                                                  
modify_ldt(1, {entry_number=0, base_addr=0x001000, limit=0x000fff, seg_32bit=0, contents=2, read_exec_only=0, limit_in_pages=0, seg_not_present=0, useable=1, lm=0}, 16) = 0                                                                                                                                              
modify_ldt(1, {entry_number=1, base_addr=0x001000, limit=0x000fff, seg_32bit=0, contents=2, read_exec_only=0, limit_in_pages=0, seg_not_present=0, useable=1, lm=0}, 16) = 0                                                                                                                                              
modify_ldt(1, {entry_number=2, base_addr=0x001000, limit=0x000fff, seg_32bit=0, contents=0, read_exec_only=0, limit_in_pages=0, seg_not_present=0, useable=1, lm=0}, 16) = 0                                                                                                                                              
mmap(NULL, 200704, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x41eed000                                                  
mprotect(0x41f1d000, 4096, PROT_NONE)   = 0                                                                                                                  
read(3, "\1\0030\4 \0\1\0p\300\0\0\260\t\0\0\260\f\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32                                                                    
read(3, "Y\211\343QU\211\345\211\310@\321\340\1\330PSQ\276\364\10\375\255\374\221\343\4\377\321\353\366\350r"..., 49264) = 49264                             
read(3, "", 0)                          = 0                                                                                                                  
read(3, "\0\0\0\0\0\0%s \0002120.6.03-dps\0libcmb 3"..., 2480) = 2480                                                                                        
modify_ldt(1, {entry_number=0, base_addr=0x41eed000, limit=0x00c06f, seg_32bit=0, contents=2, read_exec_only=0, limit_in_pages=0, seg_not_present=0, useable=0, lm=0}, 16) = 0                                                                                                                                            
modify_ldt(1, {entry_number=1, base_addr=0x41efd000, limit=0xffffffff, seg_32bit=0, contents=2, read_exec_only=0, limit_in_pages=0, seg_not_present=1, useable=0, lm=0}, 16) = 0                                                                                                                                          
modify_ldt(1, {entry_number=2, base_addr=0x41f0d000, limit=0x005895, seg_32bit=0, contents=0, read_exec_only=0, limit_in_pages=0, seg_not_present=0, useable=0, lm=0}, 16) = 0                                                                                                                                            
close(3)                                = 0                                                                                                                  
clone(child_stack=0x41f10650, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_IO) = 2830310                                                                        
ptrace(PTRACE_ATTACH, 2830310)          = 0                                                                                                                  
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_STOPPED, si_pid=2830310, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---                              
wait4(2830310, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 2830310                                                                          
ptrace(PTRACE_SETREGS, 2830310, {r15=0, r14=0, r13=0, r12=0, rbp=0, rbx=0, r11=0, r10=0, r9=0, r8=0, rax=0, rcx=0, rdx=0, rsi=0, rdi=0, orig_rax=0xf, rip=0, cs=0x7, eflags=0, rsp=0x3660, ss=0x17, fs_base=0, gs_base=0, ds=0x17, es=0x17, fs=0, gs=0}) = 0                                                              
ptrace(PTRACE_GETREGS, 2830310, {r15=0, r14=0, r13=0, r12=0, rbp=0, rbx=0, r11=0, r10=0, r9=0, r8=0, rax=0, rcx=0, rdx=0, rsi=0, rdi=0, orig_rax=0xf, rip=0, cs=0x7, eflags=0x202, rsp=0x3660, ss=0x17, fs_base=0, gs_base=0, ds=0x17, es=0x17, fs=0, gs=0}) = 0                                                          
ptrace(PTRACE_SYSEMU, 2830310, NULL, 0) = 0                                                                                                                  
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=2830310, si_uid=1000, si_status=SIGSTOP, si_utime=0, si_stime=0} ---                              
wait4(2830310, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGSTOP}], __WALL, NULL) = 2830310                                                                          
ptrace(PTRACE_GETREGS, 2830310, {r15=0, r14=0, r13=0, r12=0, rbp=0, rbx=0, r11=0, r10=0, r9=0, r8=0, rax=0, rcx=0, rdx=0, rsi=0, rdi=0, orig_rax=0xf, rip=0, cs=0x7, eflags=0x202, rsp=0x3660, ss=0x17, fs_base=0, gs_base=0, ds=0x17, es=0x17, fs=0, gs=0}) = 0                                                          
gettid()                                = 2830309                                                                                                            
getpid()                                = 2830309                                                                                                            
tgkill(2830309, 2830309, SIGSTOP)       = 0                                                                                                                  
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=2830309, si_uid=1000} ---                                                                            
--- stopped by SIGSTOP ---                                                                                                                                   

And of course, giving it a SIGCONT makes it go.

So this seems to be something going on with elksemu and how the ptrace code works. It's a minor inconvenience interactively, but its breaking my ability to do any automated testing.

Before I did further in, any suggestions?

Edit: It seems like this would have to be a Linux bug, no?

johnsonjh avatar Oct 21 '25 23:10 johnsonjh

Here is a short screen capture, because you might not believe me otherwise. :)

As you can see, sometimes I can run elksemu a few hundred iterations and sometimes it'll run 10,000 but it will ALWAYS quickly hang with a SIGSTOP. And then, after continuing with "fg", in that same shell execution elksemu will then run forever without problems, but once you stop it with ^C and re-execute the shell pipeline, it'll have the same problem again.

https://github.com/user-attachments/assets/ab77808d-8498-41b0-9bc8-617cfe49764e

johnsonjh avatar Oct 21 '25 23:10 johnsonjh

Hi @johnsonjh,

Unfortunately I can't actually run elksemu at all since I develop on macOS. The last big update to elksemu was from @BinaryMelodies in #2411, where ptrace support was added. IIRC ptrace support was not enabled in that PR, but I might have enabled it afterwards in elksemu/Makefile with USE_PTRACE=1. Perhaps you can try commenting that line out to see if it makes a difference, and otherwise perhaps look closely at that PR, or roll back elksemu to prior, just to see what happens.

As we learn more about what happens without USE_PTRACE, I can help you dig deeper into the elksemu source.

Thank you!

ghaerr avatar Oct 22 '25 00:10 ghaerr

This is just a wild guess, but does the issue persist if you add the predefined macro USE_X86EMU=1?

The current version of elksemu uses the native 16-bit protected mode of the CPU, and IIRC it works by launching a separate 16-bit process while the main process actively checks if the 16-bit process attempts to execute a system call (this was not done by me, this is a workaround for the missing vm86 mode in 64-bit mode that prior versions of elksemu relied on). This may lead to hard to reproduce bugs due to its parallelism.

If you enable the USE_X86EMU flag, the entire code will run in a single process via the software emulation library https://github.com/wfeldt/libx86emu. This may or may not improve the stability of your run, it worked well enough for some of my needs.

BinaryMelodies avatar Oct 22 '25 07:10 BinaryMelodies

Alternatively you can develop using the https://github.com/ghaerr/8086-toolchain There are make files for native compilation and for producing elks binaries on Linux. You can even use this toolchain inside ELKS! The toolchain itself is compiled for running on ELKS!

Check:

The ELKS source code top directory contains a script 'buildc86.sh' which builds the toolchain binaries, and 'copyc86.sh' which copies the toolchain binaries, headers and library to ELKS /usr, as well as an archive 'devc86.tar'.

toncho11 avatar Oct 22 '25 13:10 toncho11

You can try the toolchain under ELKS in this hdd image https://github.com/ghaerr/elks/discussions/2240 I provided.

toncho11 avatar Oct 22 '25 13:10 toncho11

I found this comment in reddit: https://www.reddit.com/r/osdev/comments/1c81b8x/good_documentation_on_v86_mode/

"VT-x lets you run VM86 inside a VMX image’s logical processor, which is the only way to run rmode code natively from “within” long mode. If you have EPT and a newerish CPU, you can boot an unrestricted guest directly into VM86. AMD-VT can run an SVM in paged real mode, or do normal guest VM86 mode. "

I think there is still hope of running ELKS applications natively on Linux x86_64 64-bit mode, right?

rafael2k avatar Dec 08 '25 11:12 rafael2k

I found this comment in reddit: https://www.reddit.com/r/osdev/comments/1c81b8x/good_documentation_on_v86_mode/

"VT-x lets you run VM86 inside a VMX image’s logical processor, which is the only way to run rmode code natively from “within” long mode. If you have EPT and a newerish CPU, you can boot an unrestricted guest directly into VM86. AMD-VT can run an SVM in paged real mode, or do normal guest VM86 mode. "

I think there is still hope of running ELKS applications natively on Linux x86_64 64-bit mode, right?

Technically, elksemu is already running 16-bit ELKS applications natively on a 64-bit Linux (unless compiled with the USE_X86EMU flag), but only in protected mode.

I have no idea how complex it would be to launch a VT-x instance and boot it up in real mode, but if I had to guess, it would probably require running a full ELKS kernel in it, instead of just translating system calls to the native Linux kernel.

BinaryMelodies avatar Dec 09 '25 10:12 BinaryMelodies

I found this comment in reddit: https://www.reddit.com/r/osdev/comments/1c81b8x/good_documentation_on_v86_mode/ "VT-x lets you run VM86 inside a VMX image’s logical processor, which is the only way to run rmode code natively from “within” long mode. If you have EPT and a newerish CPU, you can boot an unrestricted guest directly into VM86. AMD-VT can run an SVM in paged real mode, or do normal guest VM86 mode. " I think there is still hope of running ELKS applications natively on Linux x86_64 64-bit mode, right?

Technically, elksemu is already running 16-bit ELKS applications natively on a 64-bit Linux (unless compiled with the USE_X86EMU flag), but only in protected mode.

I have no idea how complex it would be to launch a VT-x instance and boot it up in real mode, but if I had to guess, it would probably require running a full ELKS kernel in it, instead of just translating system calls to the native Linux kernel.

Indeed. May use the KVM API, but this is beyond my knowledge anyway, and the way elksemu works is pretty neat, I use it sometimes and love it.

rafael2k avatar Dec 10 '25 22:12 rafael2k