x86_64 WIP: FS/GS-based TLS access abstraction

I was trying to use FS/GS based Thread Local Storage (TLS) and I noticed we don't have an easy way to view the FS or GS register as a structure (a common patter in OS-dev).

This patch would allow for the following code:

/// Kernel per-CPU layout
struct PerCpu {
    stack: VirtAddr,
    id: u32, 
}

// Declare "gs looks like PerCpu"
static PER_CPU: Wrapper<GS, PerCpu> = Wrapper::new();

// Kernel needs to call this for each thread
PER_CPU.init(Some(alloc_tls()));

// Fields can now be read/written
unsafe fn set_stack(rsp: VirtAddr) {
    tls_write!(PER_CPU, stack, rsp);
}
fn get_id() -> u32 {
    unsafe {
        tls_read!(PER_CPU, id)
    }
}

I'm not sure exactly what we want from the API, but I wanted to get feedback on the design sooner rather than later. Issues to discuss:

Which ops should be safe/unsafe?
Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?
Should the PerCpu struct have to implement some sort of trait?
Should we have a way to just read/write the entire structure (say tls_read!(PER_CPU))?
Naming (Tls vs Wrapper, tls_read! vs per_cpu_read!)
Which submodule should this live in? (instructions::tls? instructions::segmentation?)

This implementation seems to compile to fairly efficient code. See this Rust implementation vs a C implementation (using __seg_gs). The C implementation is still slightly better, which isn't surprising given that __seg_gs has LLVM support (on the other hand, I was able to get __seg_gs to crash clang, so it's quite unstable).

TODO: Docs, tests, finish design

Signed-off-by: Joe Richey [email protected]

May 26 '21 10:05 josephlr

Thanks a lot for proposing this! The design proposed here looks good to me overall. The assembly comparison is really cool, it's great to see that the use of inline assembly still leads to optimized code.

As I understand it, there is also native LLVM support for thread local data, which Rust exposes through the unstable #[thread_local] attribute. The problem with this seems to be that only fs is supported, which is typically used for user-space TLS (kernel-level TLS normally uses gs). Is this correct?

One thing that I'm curious about is how the performance of your proposal compares to storing just a pointer gs offset 0, which then points to the actual TLS struct. This would require a second memory lookup, but might allow the compile to optimize it better since the second access is just a normal memory access. I tried this approach here: https://godbolt.org/z/W4a4YxMPK . As expected, there are now two memory accesses every time, which is probably worse for smaller types (it might be worth it to get some actual performance data on this). However, for larger types, it leads to much better code. For example, consider the one_page function, which accesses a single u64 of the page field:

example::one_page:
        xor     eax, eax
        mov     rax, qword ptr gs:[rax]
        mov     rax, qword ptr [rax + 2784]
        ret

Without a pointer, we have to load the full page array first before accessing the field, which the compiler cannot optimize away: https://godbolt.org/z/hMrWdsE1Y

Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?

I think for values > 8 byte it makes more sense to go through a pointer for the mentioned performance reasons. Or maybe it is possible to adjust the macro to also support loading a specific array element or nested struct field directly?

We should also think about values that contain padding bytes. For example, does the compiler require them to have a specific value (e.g. 0)?

Should we have a way to just read/write the entire structure (say tls_read!(PER_CPU))?

Given that this is probably a very costly operation (the optimizer won't remove any of the inline_asm), I don't think that it's a good idea to support this through the same macro. If needed, we could add a separate macro for this that documents the potential performance problems (and also suggests storing a pointer as an alternative).

Which ops should be safe/unsafe?

The Wrapper::init method definitely needs to be unsafe because the given address must be valid (mapped, not aliased, etc.). Also, the TLS structure type must be unique within a thread (mixing wrapper types can lead to silent transmutes). The general read/write methods need to be unsafe too because they allow arbitrary pointers/offsets as argument.

We might be able to make the tls_read!/tls_write! macros safe, but this would some additional checks. For one, we need to assert that the GS/SS registers are initialized before access. A simple is_null check would only work if Wrapper::new initializes GS/SS with 0. Alternatively, we could add an AtomicBool flag to the Wrapper type itself.

Naming (Tls vs Wrapper, tls_read! vs per_cpu_read!)

I think I prefer the term TLS over "per cpu" since the fs/gs value is typically set per thread (and updated accordingly by the context switching code).

Which submodule should this live in? (instructions::tls? instructions::segmentation?)

I'm fine with creating a separate tls submodule. We can link it from the wrgsbase etc. methods.

May 27 '21 10:05 phil-opp

Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right? Personally I think something like this would be efficient:

static LOCAL: CpuLocal<Something> = CpuLocal::new(CpuLocalBase::FSBase, || Something::new_uninit());

fn test() {
    LOCAL.get_mut().something = 10;
    //   ^^^^^^^^^^^
    //       Returns &mut Something
     *LOCAL.get_mut() = NEW_THING;
}

impl<T> CpuLocal<T> {
    pub fn new(base: CpuLocalBase, lazy fn() -> T) -> CpuLocal<T>;
    pub fn get_mut(&self) -> &mut T;
    pub fn get(&self) -> &T;
    pub fn get_offset(&self) -> VirtAddr;
}

Jun 02 '21 04:06 Andy-Python-Programmer

Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right?

I agree, the macros should only be needed if you need maximal performance. The basic idea would be to have the TLS structure contain a pointer to itself (as @phil-opp mentioned above, stored at some fixed offset). This what Linux does and what Fushia does on x86_64 (other architectures don't need such a pointer).

The basic idea would then be to have the following API (say in a tls module):

#[repr(C)]
pub struct Value<T> {
    ptr: *const T,
    val: T,
}

impl<T> Value<T> {
    const fn new(val: T) -> Self;
}

pub struct Wrapper<S, T>(PhantomData<(S, *mut Value<T>)>);

impl<S: Segment64, T> Wrapper<S, T> {
    pub const fn new() -> Self {
        Self(PhantomData)
    }

    /// Needs to be called on each thread to initialize the segment base
    pub unsafe fn init(&self, new: *mut Value<T>) -> *mut Value<T> {
        let old = S::read_base().as_mut_ptr();
        S::write_base(VirtAddr::from_ptr(new));
        old
    }

    /// Only safe to call if `init()` has been called on this thread
    pub unsafe fn with<R>(&'static self, f: impl FnOnce(&mut T) -> R) -> R {
        let base = S::read_u64_at_offset_zero() as *const Value<T>;
        let r: &mut T = &mut (*base).val;
        f(r)
    }
}

@Andy-Python-Programmer would that sort of API work? I think the with method is needed for similar reasons to why LocalKey::with is needed. You still don't want multiple mutable aliases to the same value.

EDIT: I don't think there's a way to have some sort of lazy initialization function in the constructor. We wouldn't know when the value had been initialized, so we wouldn't know when to call the constructor. We can't use an AtomicBool flag in the Wrapper type as that would end up being a global flag rather than a per-thread flag.

Jun 02 '21 05:06 josephlr

Ah I see. The current read method takes a field. What if you want to compare two CPU local Arcs?

($wrapper:path, $field:tt) => {{

if Arc::ptr_eq(&new_process, &previous_process) {
    return false;
}

This is quite common as we want the current process to be local to the CPU

Jun 02 '21 05:06 Andy-Python-Programmer

What if you want to compare two CPU local Arcs?

I guess you would need to copy them into the appropriate context. We could also add an as_ptr() method to Wrapper that returned a *const T. It would then be up to the user to make sure dereferencing that pointer was safe (and that init() had been called before as_ptr).

Jun 02 '21 05:06 josephlr

All good!

Jun 02 '21 05:06 Andy-Python-Programmer

I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.

For example, consider this setup for GS-based thread-local:

#[repr(C)]
struct ThreadLocal {
    a: u64,
    b: u32,
}

const LOCAL: Wrapper<GS, ThreadLocal> = Wrapper::new();

Using the with method:

pub fn slow_get_a() -> u64 {
    unsafe { LOCAL.with(|t| t.a) }
}
pub fn slow_set_b(v: u32) {
    unsafe { LOCAL.with(|t| t.b = v) }
}

gives code that uses two movs:

example::slow_get_a:
        mov     rax, qword ptr gs:[0]
        mov     rax, qword ptr [rax]
        ret
example::slow_set_b:
        mov     rax, qword ptr gs:[0]
        mov     dword ptr [rax + 8], edi
        ret

Using the macros:

pub fn fast_get_a() -> u64 {
    unsafe { tls_read!(LOCAL.a) }
}

pub fn fast_set_b(v: u32) {
    unsafe { tls_write!(LOCAL.b, v) }
}

gives code equal to the C equivalent:

example::fast_get_a:
        mov     rax, qword ptr gs:[8]
        ret
example::fast_set_b:
        mov     dword ptr gs:[16], edi
        ret

Godbolt Link

Jun 02 '21 07:06 josephlr

The macro actually produces more optimised assembly. I see

Jun 02 '21 10:06 Andy-Python-Programmer

I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.

Sounds good to me!

Jun 05 '21 13:06 phil-opp

x86_64 x86_64 copied to clipboard

WIP: FS/GS-based TLS access abstraction

x86_64
x86_64 copied to clipboard