colima icon indicating copy to clipboard operation
colima copied to clipboard

Rosetta x86 `mmap()` syscall behaviour abnormal, causes SIGSEGV

Open enduity opened this issue 2 weeks ago • 3 comments

Description

When running x86_64 binaries via Rosetta 2 (VZ) on Linux, the mmap syscall exhibits non-atomic destructive behavior when MAP_FIXED is used.

If an mmap call with MAP_FIXED is attempted over existing valid memory, but the allocation fails (e.g., due to MAP_HUGETLB constraints or alignment issues), the underlying memory at the target address is unmapped/cleared before the error is returned.

This leaves the process with a corrupted memory map (holes in the heap or stack), leading to SIGSEGV or data corruption. Standard Linux kernel behavior (and QEMU emulation) guarantees atomicity: if the syscall fails, the existing memory map should remain untouched.

This specifically crashes PHP Opcache (CLI and FPM) on standard x86 images, as PHP aggressively attempts to allocate Huge Pages using MAP_FIXED fallback logic.

Version

❯ colima version && limactl --version
colima version 0.9.1
git commit: 0cbf719f5409ce04b9f0607b681c005d2ff7d94a

runtime: docker
arch: aarch64
client: v29.0.1
server: v28.4.0
limactl version 2.0.1

Operating System

  • [ ] macOS Intel <= 13 (Ventura)
  • [ ] macOS Intel >= 14 (Sonoma)
  • [ ] Apple Silicon <= 13 (Ventura)
  • [x] Apple Silicon >= 14 (Sonoma)
  • [ ] Linux

Output of colima status

❯ colima status
INFO[0000] colima is running using macOS Virtualization.Framework
INFO[0000] arch: aarch64
INFO[0000] runtime: docker
INFO[0000] mountType: virtiofs
INFO[0000] docker socket: unix:///Users/endrik.einberg/.colima/default/docker.sock
INFO[0000] containerd socket: unix:///Users/endrik.einberg/.colima/default/containerd.sock

Reproduction Steps

  1. Start Colima with VZ and Rosetta enabled: colima start --arch aarch64 --vm-type vz --vz-rosetta

  2. Run an Alpine x86_64 PHP container enabling Opcache (CLI mode). This triggers the memory corruption immediately:

    docker run --rm --platform linux/amd64 php:8.3-cli-alpine ash -c \
      'php -n -dzend_extension=opcache -dopcache.enable_cli=1 -v'
    

    Result: Segmentation fault (core dumped)

  3. (Optional) Run an Ubuntu x86_64 PHP container enabling Opcache (CLI mode). This triggers a different error message:

    docker run --rm --platform linux/amd64 php:8.3-cli sh -c \
      'php -n -dzend_extension=opcache -dopcache.enable_cli=1 -v'
    

    Result:

    rosetta error: futex(FUTEX_LOCK_PI_PRIVATE) failure: 35
    Trace/breakpoint trap (core dumped)
    
  4. (Optional) Run this C Proof of Concept which maps a "Survivor" page, writes data to it, and then attempts to overwrite it with a MAP_FIXED | MAP_HUGETLB allocation that is guaranteed to fail.

    Click to expand
    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <errno.h>
    #include <signal.h>
    #include <setjmp.h>
    #include <string.h>
    #include <stdint.h>
    
    #define HUGE_PAGE_SIZE (2 * 1024 * 1024)
    
    sigjmp_buf jump_env;
    void segv_handler(int sig) { siglongjmp(jump_env, 1); }
    
    int main() {
        size_t temp_alloc_size = 2 * HUGE_PAGE_SIZE;
        void *temp_base = mmap(NULL, temp_alloc_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (temp_base == MAP_FAILED) {
            perror("Failed to allocate temp base for alignment");
            return 1;
        }
        printf("[0] Temporary base allocated at %p for alignment\n", temp_base);
    
        void *aligned_addr = (void *)(((uintptr_t)temp_base + HUGE_PAGE_SIZE - 1) & ~(HUGE_PAGE_SIZE - 1));
        printf("[0] Calculated aligned address: %p\n", aligned_addr);
    
        void *survivor = mmap(aligned_addr, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
        if (survivor == MAP_FAILED) {
            perror("Failed to map survivor at aligned address");
            munmap(temp_base, temp_alloc_size);
            return 1;
        }
    
        *(volatile unsigned char*)survivor = 0x42;
        printf("[1] Survivor mapped at %p (Value: 0x42). This address is HUGE_PAGE_SIZE aligned.\n", survivor);
    
        printf("[2] Attempting mmap(FIXED | HUGETLB) on top of survivor (%p)...\n", survivor);
        void *p = mmap(survivor, HUGE_PAGE_SIZE, PROT_READ|PROT_WRITE,
                       MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB, -1, 0);
    
        if (p == MAP_FAILED) {
            printf("    -> mmap FAILED. Errno: %d (%s)\n", errno, strerror(errno));
    
            signal(SIGSEGV, SIG_DFL);
            struct sigaction sa;
            memset(&sa, 0, sizeof(struct sigaction));
            sa.sa_handler = segv_handler;
            sa.sa_flags = SA_NODEFER;
            sigaction(SIGSEGV, &sa, NULL);
    
            if (sigsetjmp(jump_env, 1) == 0) {
                unsigned char val = *(volatile unsigned char*)survivor;
                signal(SIGSEGV, SIG_DFL);
                if (val == 0x42) {
                    printf("    -> PASS: Memory preserved (Value: 0x%X).\n", val);
                } else {
                    printf("    -> FAIL: Memory preserved but corrupted (Value: 0x%X)!\n", val);
                }
            } else {
                signal(SIGSEGV, SIG_DFL);
                printf("    -> CRITICAL FAIL: Memory was unmapped! (SIGSEGV)\n");
            }
        } else {
            printf("    -> mmap succeeded at %p. (Unexpected for this test case on Rosetta)\n", p);
            munmap(p, HUGE_PAGE_SIZE);
        }
    
        munmap(survivor, 4096);
        munmap(temp_base, temp_alloc_size);
    
        return 0;
    }
    

    Result: CRITICAL FAIL: Memory was unmapped! (SIGSEGV)

Expected behaviour

If mmap(..., MAP_FIXED) fails (returns MAP_FAILED), the existing memory mapping at that address should remain untouched. The operation should be atomic.

Colima currently returns ENOMEM but unmaps the memory anyway.

Additional context

Docker Desktop comparison

We compared this behavior against Docker Desktop, which also uses Apple's Virtualization.framework and Rosetta, but does not crash.

Docker Desktop avoids this bug by using a workaround. The exact details are unknown, but mmap calls with the MAP_HUGETLB flag are somehow intercepted and EPERM is returned before the syscall hits the actual Rosetta translation layer.

Comparison of `strace` output

Colima:

mmap(0x7f..., ... MAP_FIXED|MAP_HUGETLB ...) = -1 ENOMEM (Out of memory)
# Side effect: The memory at 0x7f... is destroyed/unmapped.

Docker Desktop:

mmap(0x7f..., ... MAP_FIXED|MAP_HUGETLB ...) = -1 EPERM (Operation not permitted)
# Side effect: None. The memory remains intact.

Related issues in other projects

It seems like this was explicitly fixed in Docker Desktop and OrbStack at least once:

  • https://github.com/docker/for-mac/issues/7147
  • https://github.com/orbstack/orbstack/issues/380

PHP code reference

The PHP internals that actually make this mmap() call are here:

  • https://github.com/php/php-src/blob/78a24ffc032804755e31bb308c0e754cbc049051/ext/opcache/shared_alloc_mmap.c#L250-L252
munmap(p, requested_size);
p = (void*)(ZEND_MM_ALIGNED_SIZE_EX((ptrdiff_t)p, huge_page_size));
p = mmap(p, requested_size, flags, MAP_SHARED|MAP_ANONYMOUS|MAP_32BIT|MAP_HUGETLB|MAP_FIXED, -1, 0);

This code is also buggy because the recomputed memory location is not checked for overlaps. However, this actually further proves that Colima's behaviour is unexpected – otherwise this bug would have already been fixed, as it has existed for a long time.

enduity avatar Nov 24 '25 08:11 enduity