x86_64
x86_64 copied to clipboard
WIP: FS/GS-based TLS access abstraction
I was trying to use FS/GS based Thread Local Storage (TLS) and I noticed we don't have an easy way to view the FS or GS register as a structure (a common patter in OS-dev).
This patch would allow for the following code:
/// Kernel per-CPU layout
struct PerCpu {
stack: VirtAddr,
id: u32,
}
// Declare "gs looks like PerCpu"
static PER_CPU: Wrapper<GS, PerCpu> = Wrapper::new();
// Kernel needs to call this for each thread
PER_CPU.init(Some(alloc_tls()));
// Fields can now be read/written
unsafe fn set_stack(rsp: VirtAddr) {
tls_write!(PER_CPU, stack, rsp);
}
fn get_id() -> u32 {
unsafe {
tls_read!(PER_CPU, id)
}
}
I'm not sure exactly what we want from the API, but I wanted to get feedback on the design sooner rather than later. Issues to discuss:
- Which ops should be safe/unsafe?
- Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?
- Should the
PerCpustruct have to implement some sort of trait? - Should we have a way to just read/write the entire structure (say
tls_read!(PER_CPU))? - Naming (
TlsvsWrapper,tls_read!vsper_cpu_read!) - Which submodule should this live in? (
instructions::tls?instructions::segmentation?)
This implementation seems to compile to fairly efficient code. See this Rust implementation vs a C implementation (using __seg_gs). The C implementation is still slightly better, which isn't surprising given that __seg_gs has LLVM support (on the other hand, I was able to get __seg_gs to crash clang, so it's quite unstable).
TODO: Docs, tests, finish design
Signed-off-by: Joe Richey [email protected]
Thanks a lot for proposing this! The design proposed here looks good to me overall. The assembly comparison is really cool, it's great to see that the use of inline assembly still leads to optimized code.
As I understand it, there is also native LLVM support for thread local data, which Rust exposes through the unstable #[thread_local] attribute. The problem with this seems to be that only fs is supported, which is typically used for user-space TLS (kernel-level TLS normally uses gs). Is this correct?
One thing that I'm curious about is how the performance of your proposal compares to storing just a pointer gs offset 0, which then points to the actual TLS struct. This would require a second memory lookup, but might allow the compile to optimize it better since the second access is just a normal memory access. I tried this approach here: https://godbolt.org/z/W4a4YxMPK . As expected, there are now two memory accesses every time, which is probably worse for smaller types (it might be worth it to get some actual performance data on this). However, for larger types, it leads to much better code. For example, consider the one_page function, which accesses a single u64 of the page field:
example::one_page:
xor eax, eax
mov rax, qword ptr gs:[rax]
mov rax, qword ptr [rax + 2784]
ret
Without a pointer, we have to load the full page array first before accessing the field, which the compiler cannot optimize away: https://godbolt.org/z/hMrWdsE1Y
- Should reading/writing any field work? Or should we restrict to just 1/2/4/8-byte wide fields?
I think for values > 8 byte it makes more sense to go through a pointer for the mentioned performance reasons. Or maybe it is possible to adjust the macro to also support loading a specific array element or nested struct field directly?
We should also think about values that contain padding bytes. For example, does the compiler require them to have a specific value (e.g. 0)?
- Should we have a way to just read/write the entire structure (say
tls_read!(PER_CPU))?
Given that this is probably a very costly operation (the optimizer won't remove any of the inline_asm), I don't think that it's a good idea to support this through the same macro. If needed, we could add a separate macro for this that documents the potential performance problems (and also suggests storing a pointer as an alternative).
- Which ops should be safe/unsafe?
The Wrapper::init method definitely needs to be unsafe because the given address must be valid (mapped, not aliased, etc.). Also, the TLS structure type must be unique within a thread (mixing wrapper types can lead to silent transmutes). The general read/write methods need to be unsafe too because they allow arbitrary pointers/offsets as argument.
We might be able to make the tls_read!/tls_write! macros safe, but this would some additional checks. For one, we need to assert that the GS/SS registers are initialized before access. A simple is_null check would only work if Wrapper::new initializes GS/SS with 0. Alternatively, we could add an AtomicBool flag to the Wrapper type itself.
- Naming (
TlsvsWrapper,tls_read!vsper_cpu_read!)
I think I prefer the term TLS over "per cpu" since the fs/gs value is typically set per thread (and updated accordingly by the context switching code).
- Which submodule should this live in? (
instructions::tls?instructions::segmentation?)
I'm fine with creating a separate tls submodule. We can link it from the wrgsbase etc. methods.
Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right? Personally I think something like this would be efficient:
static LOCAL: CpuLocal<Something> = CpuLocal::new(CpuLocalBase::FSBase, || Something::new_uninit());
fn test() {
LOCAL.get_mut().something = 10;
// ^^^^^^^^^^^
// Returns &mut Something
*LOCAL.get_mut() = NEW_THING;
}
impl<T> CpuLocal<T> {
pub fn new(base: CpuLocalBase, lazy fn() -> T) -> CpuLocal<T>;
pub fn get_mut(&self) -> &mut T;
pub fn get(&self) -> &T;
pub fn get_offset(&self) -> VirtAddr;
}
Amazing!! It would be nice to be able to do this without needing to use fancy macros. Causz that could lead to lots of inlined code. Right?
I agree, the macros should only be needed if you need maximal performance. The basic idea would be to have the TLS structure contain a pointer to itself (as @phil-opp mentioned above, stored at some fixed offset). This what Linux does and what Fushia does on x86_64 (other architectures don't need such a pointer).
The basic idea would then be to have the following API (say in a tls module):
#[repr(C)]
pub struct Value<T> {
ptr: *const T,
val: T,
}
impl<T> Value<T> {
const fn new(val: T) -> Self;
}
pub struct Wrapper<S, T>(PhantomData<(S, *mut Value<T>)>);
impl<S: Segment64, T> Wrapper<S, T> {
pub const fn new() -> Self {
Self(PhantomData)
}
/// Needs to be called on each thread to initialize the segment base
pub unsafe fn init(&self, new: *mut Value<T>) -> *mut Value<T> {
let old = S::read_base().as_mut_ptr();
S::write_base(VirtAddr::from_ptr(new));
old
}
/// Only safe to call if `init()` has been called on this thread
pub unsafe fn with<R>(&'static self, f: impl FnOnce(&mut T) -> R) -> R {
let base = S::read_u64_at_offset_zero() as *const Value<T>;
let r: &mut T = &mut (*base).val;
f(r)
}
}
@Andy-Python-Programmer would that sort of API work? I think the with method is needed for similar reasons to why LocalKey::with is needed. You still don't want multiple mutable aliases to the same value.
EDIT: I don't think there's a way to have some sort of lazy initialization function in the constructor. We wouldn't know when the value had been initialized, so we wouldn't know when to call the constructor. We can't use an AtomicBool flag in the Wrapper type as that would end up being a global flag rather than a per-thread flag.
Ah I see. The current read method takes a field. What if you want to compare two CPU local Arcs?
($wrapper:path, $field:tt) => {{
if Arc::ptr_eq(&new_process, &previous_process) {
return false;
}
This is quite common as we want the current process to be local to the CPU
What if you want to compare two CPU local Arcs?
I guess you would need to copy them into the appropriate context. We could also add an as_ptr() method to Wrapper that returned a *const T. It would then be up to the user to make sure dereferencing that pointer was safe (and that init() had been called before as_ptr).
All good!
I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.
For example, consider this setup for GS-based thread-local:
#[repr(C)]
struct ThreadLocal {
a: u64,
b: u32,
}
const LOCAL: Wrapper<GS, ThreadLocal> = Wrapper::new();
Using the with method:
pub fn slow_get_a() -> u64 {
unsafe { LOCAL.with(|t| t.a) }
}
pub fn slow_set_b(v: u32) {
unsafe { LOCAL.with(|t| t.b = v) }
}
gives code that uses two movs:
example::slow_get_a:
mov rax, qword ptr gs:[0]
mov rax, qword ptr [rax]
ret
example::slow_set_b:
mov rax, qword ptr gs:[0]
mov dword ptr [rax + 8], edi
ret
Using the macros:
pub fn fast_get_a() -> u64 {
unsafe { tls_read!(LOCAL.a) }
}
pub fn fast_set_b(v: u32) {
unsafe { tls_write!(LOCAL.b, v) }
}
gives code equal to the C equivalent:
example::fast_get_a:
mov rax, qword ptr gs:[8]
ret
example::fast_set_b:
mov dword ptr gs:[16], edi
ret
The macro actually produces more optimised assembly. I see
I think I'm going to initially submit a PR that adds the non-macro changes, and then do the macro stuff in a followup PR. The code generated is fairly good for both regardless, so there's only a marginal benefit to the macro approch.
Sounds good to me!