mold icon indicating copy to clipboard operation
mold copied to clipboard

Kernel page-in linking

Open rui314 opened this issue 2 years ago • 22 comments

In this year's WWDC video about linking (*1), Apple announced a feature that was new to me. It sounded like a neat idea, and I now wonder if we can do the same thing for Linux and other operating systems.

(*1) https://developer.apple.com/videos/play/wwdc2022/110362/ at 25:04

The problem they were trying to solve is the inefficiency of the position-independent executables (PIE). When a PIE is executed, the loader maps it to an arbitrary place in the virtual address space and apply base relocations to fix up absolute addresses in the executable image.

Applying base relocations is easy; the loader calculates the difference between the executable's desired load address and its actual load address, and fix all word-size values containing addresses by subtracting that difference. But that operation is expensive for two reasons. First, there are usually many locations to fix. Second, the loader touches many pages, so the kernel has to load a lot of pages from the file to memory, and it makes lots of pages dirty.

The idea Apple has implemented is to let not the loader but the kernel to apply base relocations. When a process is started, the loader doesn't apply base relocations. Instead, the kernel applies them as they page-in file contents. I think this is a clever idea because it solves both problems I mentioned above. Now, the loader doesn't have to fix many places, so the process startup gets faster. And the system as a whole now has fewer dirty memory pages, because unnecessary pages are not base-relocated at all. When memory pressure becomes high, the kernel can even discard base-relocated pages (instead of paging them out) because it knows how to apply base relocations when they are needed next time.

I don't know if there's an ongoing effort to implement the same thing to Linux. If not, we might want to start a project.

rui314 avatar Jul 13 '22 08:07 rui314

I think as long as the kernel provides the following features, we can implement the rest in the user space:

  1. An interface to register a memory region in which we are interested when page-in occurs
  2. An in-kernel mechanism to call a user-space observer (something like signal handler) when a page-in occurs

With that, we can mmap(2) an executable to memory, register a page-in observer for that memory region, and let the observer to apply base relocations.

rui314 avatar Jul 13 '22 08:07 rui314

I think that's a very cool idea! I'm not sure if/how useful this might be, but I've had to play a little with user-space exposed Mach calls (which have kernel-space equivalents too) to perform on-the-fly rebases myself when doing hot-code swapping experiment in the presence of ASLR on macOS with Zig. Here's the link to my write-up: Hot-code reloading on macOS/arm64 with Zig. Perhaps you will find it useful when trying to work out how it works in Mach with the intention to apply the ideas to the Linux kernel.

kubkon avatar Jul 13 '22 08:07 kubkon

I wonder if @MaskRay is interested.

rui314 avatar Jul 13 '22 12:07 rui314

This looks very interesting. For small binaries this should be kinda irrelevant (I suppose it's part of the reason why prelink faded away), but I can think of initialization time being more significant in large binaries like Chromium (hundreds of MBs) or Unreal Engine (1GB). An interesting benchmark would be to compare the relocating + fault time to the actual library+application initialization cost.

On Linux, one should be able to use either SIGSEGV handler (slow) or userfaultfd (faster), although both are pretty invasive and can interfere with browser sandboxes.

I wonder if relocations can be resolved fully incrementally? Can't think of any obstacles right now, but just curious if something like e.g. the sequential jump instruction allocation dependencies in link-time relocation also exists for run-time relocation.

Also, if one is going to implement a dynamic linker, then it needs to replace the glibc ld.so component, which has some ABI dependency on glibc internals. It should not stop someone from making a prototype, but to make it widely adoptable we will eventually need to tackle that problem too.

ishitatsuyuki avatar Jul 13 '22 14:07 ishitatsuyuki

Re: dynamic linker

I think we should seriously consider writing our own dynamic linker and statically link that dynamic linker to main executables, so that we don't need to depend on a dynamic linker in an executing environment. Dynamic linker improves extremely slowly and almost fossilized. There are many things we can do in theory but can't do in practice because we can't replace a dynamic linker. So I think there's an opportunity to disrupt it. The separation of "loaders" and "user-land programs" is not absolutely necessary; we can change that.

As an example, consider QUIC and TCP. QUIC is implemented in the user-land as part of applications and evolves quickly, while TCP is in the kernel and fossilized (which is a terminology of networking indicating that we can't change the protocol anymore).

rui314 avatar Jul 13 '22 14:07 rui314

The problem is that writing your dynamic linker meaning replacing the libc. dlopen in glibc simply doesn't work without glibc ld.so, and I won't be surprised if other seemingly dlopen-unrelated functionality also depends on some ld.so behavior (thinking of NSS).

I'm not against the idea of having a new dynamic linker, just that perhaps we should talk to the glibc people so we can for example get a linker-libc ABI that isn't undocumented and proprietary (which is how it is today).

ishitatsuyuki avatar Jul 13 '22 14:07 ishitatsuyuki

Maybe we can just use musl?

rui314 avatar Jul 13 '22 15:07 rui314

Given glibc's track record of overengineering stuff (NSS is one example), I agree that musl would be a much better starting point. musl currently does not support dlopen in static linked binaries, but given a codebase cleaner than GNU's I think it will be workable to make it work with our in-house (static) dynamic linker.

ishitatsuyuki avatar Jul 13 '22 15:07 ishitatsuyuki

Many Linux systems use non-glibc libc's (e.g. Android), so it should be viable. It feels a bit crazy to write our own dynamic linker, but writing a GNU-compatible static linker also felt crazy when I heard of the lld project, so I'm not too worried about it. It's doable if we are serious, and maybe there are lots of opportunities along the way. I don't yet know if it's a good idea, but I like that kind of crazy ideas. If people haven't explored interesting ideas, that are hints of opportunities.

rui314 avatar Jul 13 '22 15:07 rui314

Windows does, or did, exactly what OP is proposing. https://devblogs.microsoft.com/oldnewthing/20041217-00/?p=36953 https://devblogs.microsoft.com/oldnewthing/20160413-00/?p=93301

If it's good enough for both Windows and Mac, it's certainly worth investigating on Linux. Worst case, it ends up not working, but we'll learn why it doesn't work.

Alcaro avatar Jul 13 '22 15:07 Alcaro

I wonder if eBPF can be applied to this problem. If fixing up base relocations is really that easy (taking your word on it), the kernel might be able to provide a hook to apply fixups. Based on my bpf / kernel hacking experience, it doesn't sound too difficult to implement. The API would be a different story.

danobi avatar Jul 13 '22 18:07 danobi

@Alcaro

Thanks for the info! I didn't know that. The format of base relocation information in PE/COFF is indeed designed to make page-in relocation easy.

@danobi Yeah, that's what I was thinking too. I don't know if we need that complex mechanism, but this is the situation in which "we just want to run this small piece of code in the kernel when this event occurs", so it looks like eBPF could be an option.

rui314 avatar Jul 14 '22 04:07 rui314

The kernel page-in technique is indeed interesting.

There are several ways encoding a relative relocation. In terms of space efficiency, I think the sorted order is:

  • ELF REL/RELA (most 64-bit architectures unfortunately use RELA for dynamic relocations)
  • PE/COFF .reloc, 16-bit for one entry
  • Mach-O __LINKEDIT,__rebase, using bytecode
  • ELF RELR

RELR is the most efficient one. I have notified some distributions to adopt RELR when they upgrade to glibc 2.36 (scheduled on 2022-08-01): https://maskray.me/blog/2021-10-31-relative-relocations-and-relr "When will Linux distributions adopt RELR?"

If page-in technique makes a significant difference on Linux, we can let rtld communicate the (DT_RELR, DT_RELRSZ) information to the kernel. It seems that bothe userspace and the kernel may need to parse RELR.

MaskRay avatar Jul 14 '22 05:07 MaskRay

The kernel page-in technique is indeed interesting.

There are several ways encoding a relative relocation. In terms of space efficiency, I think the sorted order is:

* ELF REL/RELA (most 64-bit architectures unfortunately use RELA for dynamic relocations)

* PE/COFF `.reloc`, 16-bit for one entry

* Mach-O `__LINKEDIT,__rebase`, using bytecode

Hmm, did you actually mean the rebase opcodes REBASE_OPCODE_* encoded within a blob pointed to at by __DYLD_INFO_ONLY load command? AFAIK Apple is now abandoning this approach in favour of fixup chains with its latest ld64 on arm64. I haven't analysed what it actually looks like after parsing, but I believe the claim was that it's meant to be even more space-efficient compared to storing REBASE_OPCODE_* opcodes (as it's relative offset-based much like GOT entries which are now offsets and not pointers). For example, we can observe this load command used in output produced by ld64 on arm64 Macs:

Load command 5
      cmd LC_DYLD_CHAINED_FIXUPS
  cmdsize 16
  dataoff 84918272
 datasize 173960

kubkon avatar Jul 14 '22 09:07 kubkon

Just found some more documentation around the topic of relocation by the Chromium team.

https://chromium.googlesource.com/chromium/src/+/master/docs/native_relocations.md#Linux-Android-Relocations-ELF-Format

Quoting the numbers, the dirty page size coming from relocations is around 6MB and startup impact is about 20ms.

ishitatsuyuki avatar Jul 15 '22 03:07 ishitatsuyuki

Thanks for the link. That's very interesting. I believe Android is one of the best platforms to implement the page-in linking for various reasons:

  1. It's Linux-based but Google controls it, so it can (reasonably) freely improve the loader and the kernel,
  2. it's often memory-constrained, and
  3. I'm pretty sure that Apple will land the page-in linking in the next version of iOS because they did it to macOS already, and to complete with that Google has an incentive to do the same

I'm no longer working for Google and I've never worked for the Android team, though. But if they want to implement it, I'm happy to join a discussion.

rui314 avatar Jul 15 '22 04:07 rui314

The above points also apply to ChromeOS as it's controlled by a single company and often runs on memory-constrained machines.

rui314 avatar Jul 15 '22 09:07 rui314

I believe you're looking for userfaultfd (https://man7.org/linux/man-pages/man2/userfaultfd.2.html). It's already upstream and used for VM and container migrations.

pkhuong avatar Jul 19 '22 03:07 pkhuong

userfaultfd wouldn't work because application developers have to be aware that there's a thread that handles page faults. That's not compatible with many applications. For example, if you call fork(2), the background thread will disappear.

rui314 avatar Jul 19 '22 04:07 rui314

I was doing some random browser stack digging which led me to revisit this.

Quoting the Chromium docs again, they have a thing called Zygote which is basically a process template that gets fork()ed when a new process is needed (Chromium launches tons of process for site isolation). Firefox has also implemented this in the name of "fork server" (arguably a better name).

An approach like Windows' is certainly powerful, since it automatically enables sharing for all processes in the system. On the other hand, it obviously has a bar to adoption (for Linux), and Zygote wins on this end for fitting better into the current POSIX model.

(Off-topic: Given the extensive executable code sharing used in Chromium, #466 is probably something interesting to try.)

ishitatsuyuki avatar Dec 06 '22 16:12 ishitatsuyuki

Maybe we can just use musl?

Bit late to the party on this issue, but there is also mlibc which has a dynamic linker written in (vaguely) modern C++ and slightly more featureful than musl's. Could be a better starting point.

64 avatar Dec 31 '22 19:12 64

Quoting the Chromium docs again, they have a thing called Zygote which is basically a process template that gets fork()ed when a new process is needed (Chromium launches tons of process for site isolation). Firefox has also implemented this in the name of "fork server" (arguably a better name).

Zygote is also used for sandboxing apps on Android.

leap0x7b avatar Apr 01 '23 07:04 leap0x7b