libctru
libctru copied to clipboard
TLS memory errors with aligned thread-local variables
Born from: https://github.com/Meziu/ctru-rs/issues/60
Description
After some testing on our Rust libraries for this system, we've noticed a weird bug with the use of thread-local storage. With some investigation, we have pinpointed the problem to the use of aligned thread-local variables. As our tests show, the offset of the variable pointers are wrong when one or more aligned TLS variables exist, causing memory leaks, access of uninitialized memory etc. This issue appears in both Rust and C programs.
Furthermore, this issue seems to only affect (at least in our examples) programs with a debug
optimization level.
E.g. :
debug
mode in Rust reproduces the bug.
release
mode in Rust doesn't reproduce the bug.
-Og
in C reproduces the bug.
-O2
in C doesn't reproduce the bug.
Here is a test program to try the issue. Remember to use -Og
to replicate:
#include <3ds.h>
#include <stdio.h>
#include <string.h>
typedef ALIGN(4) struct {
u8 inner[3];
} Align4;
typedef ALIGN(16) struct {
u8 inner[3];
} Align16;
static __thread Align4 BUF_4 = {.inner = {2, 2, 2}};
static __thread Align16 BUF_16 = {.inner = {1, 1, 1}};
int
main(int argc, char** argv)
{
gfxInitDefault();
consoleInit(GFX_TOP, NULL);
BUF_16.inner[0] = 0;
bool reproduced = false;
// Check if at least one byte has been offset improperly
printf("[");
for (int i = 0; i < 3; i++) {
if (BUF_4.inner[i] != 2) {
reproduced = true;
}
printf("%d, ", BUF_4.inner[i]);
}
printf("]\n");
if (reproduced) {
printf("Bug has been reproduced!\n");
}
else {
printf("There has been no issue!");
}
// Main loop
while (aptMainLoop()) {
gspWaitForVBlank();
hidScanInput();
u32 kDown = hidKeysDown();
if (kDown & KEY_START)
break; // break in order to return to hbmenu
// Flush and swap framebuffers
gfxFlushBuffers();
gfxSwapBuffers();
}
gfxExit();
return 0;
}
Notes about our Rust toolchain
Our Rust programs mainly use libctru
(to substitute missing newlib
functions), and the arm-none-eabi-gcc
linker. Most of the linking flags are the same (basically those in a normal Makefile but in the Rust target-specification). Either way, it's hard to believe this one to be an issue in either of the compilers (rustc
and gcc
) as the issue reproduces in both. It's likely an issue with libctru
or the linker.
Generally, 2*sizeof(void*)
is the maximum alignment supported by the TLS section, as a side effect of how it's allocated. I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste. I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?
Re: non-C/C++ languages. It might sound disappointing, but it is not advised to attempt to bring them to devkitPro homebrew environments, as they constitute an endless source of build maintainability and reproducibility problems; not to mention problems caused by necessary and unavoidable API/ABI changes in our upstream, which encourages users of those 3rd party languages to damage their toolchain installations, and ultimately produces unhappy users that are unable to compile simple programs.
Generally,
2*sizeof(void*)
is the maximum alignment supported by the TLS section, as a side effect of how it's allocated. I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste.
Thanks for the information! It is very interesting, but this explanation doesn't provide an answer to why there is a difference between low or high optimization levels.
I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?
I'd lie if I said I knew one. I have no idea why, but a math library (deep down in a dependency tree for other things I was testing) used alignment in TLS variables. It is important to note however how much this kind of functionality is used in Rust programs. Because of how easy it is to manage multi-threaded applications in Rust, many libraries include into their systems non-optional threading functionality. We are working to make managing 3DS preemptive threads just as easy, but to have native support for these kinds of things we need to make some compromises. For now, we have implemented (a quite wasteful, yet highly compatible) heap-based TLS system. Having out-of-the-box support for Rust's most popular libraries is one of our objectives.
Re: non-C/C++ languages. It might sound disappointing, but it is not advised to attempt to bring them to devkitPro homebrew environments, as they constitute an endless source of build maintainability and reproducibility problems; not to mention problems caused by necessary and unavoidable API/ABI changes in our upstream, which encourages users of those 3rd party languages to damage their toolchain installations, and ultimately produces unhappy users that are unable to compile simple programs.
We know, I know. It was clear since the start that maintaining our toolchain would be the hardest part (especially long term), which is why I always took some precautions before attempting any important decisions. We aim at abstracting libctru
code behind a standard Newlib implementation, to ensure longevity and need of small changes in Rust's official tree. If everything goes according to plan, we may have to never modify our code in the Rust std
, for example, and even if we had to, by using standard names and functions as any other Newlib implementation (mainly thanks to our multifills rust-linker-fix-3ds and pthread-3ds) it is easier to make contributors understand how the system works.
It's a big task, but we strive to overcome it, thanks to our efforts and those of the many contributors that have already joined us. Also, writing compiled code that doesn't give me nightmares of unspecified ARM panics is a dream I just can't give up on. Hope you can understand where we are coming from.
I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste. I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?
@fincs would it be possible to set something like this in the linker script?
.tdata ALIGN(4) :
{
__tdata_lma = .;
LONG (ALIGNOF (.tdata));
*(.tdata)
*(.tdata.*)
*(.gnu.linkonce.td.*)
. = ALIGN(4);
__tdata_lma_end = .;
} : data
The first LONG
there could then be used at runtime, something like
Thread t = (Thread)memalign(*(size_t*)(__tdata_lma),allocsize+tlssize);
Obviously, the rest of the section would be offset by the extra data but I think that could be worked around in the implementation?
From a brief test, I can print the value and it appears to correctly reflect the max alignment, so maybe this would be a way to support arbitrary aligns in TLS? I'm not sure how most TLS implementations handle this, but the linker seems to me like the only place that would have the right information to insert for runtime usage...
I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?
I don't know too much about the details of the algorithm, or the performance implications, but it comes up in this change to add threading support to a matrix multiplication library: bluss/matrixmultiply@2ddd0ba
(#52).
I assume that the algorithm primarily targets x86 platforms but is meant to be portable and efficient, and it looks like they are using the TLS as some kind of buffer for for parallel calculations, maybe so the whole thing fits in a cache line or something like that? I don't know too much about the details or performance implications there.
After a fair bit of trial and error, I think I have a way to make the TLS implementation work for over-aligned variables like these.
I believe it's possible to make the changes in a backwards-compatible way as well, if necessary, and that the implementation I have does not incur a significant additional runtime cost compared to the existing implementation (other than the additional memory usage you mentioned, which should only affect executables that explicitly use these kinds of over-aligned TLS variables).
@fincs would you accept PRs here and in https://github.com/devkitPro/devkitarm-crtls to make this possible?
Sure, go for it. If all, I'd try to avoid placing the LONG()
value inside the tdata section if possible.
Sure, go for it. If all, I'd try to avoid placing the
LONG()
value inside the tdata section if possible.
What I have now puts it in the .rodata
instead (although I'm not sure that's the best section), since that doesn't throw off the size of .tdata
. I was originally going to just use a symbol without any LONG
at all (just using ABSOLUTE
and never dereferencing the symbol), but this doesn't seem to be supported by 3dsxtool
since it becomes an invalid relocation.
Do you think storing the value in a different section is okay, or should I try to work around 3dsxtool to make a "valid" relocation (like __tls_align = __tdata_start + ALIGNOF(...)
or something like that)?
should I try to work around 3dsxtool to make a "valid" relocation
I anticipated you would run into this problem. It is indeed not correct to try to expose the desired value as a symbol. Symbols are for addresses to "objects", not raw non-address values; and the linker (and 3dsxtool) will get confused otherwise, thinking there is an object at said address.
What I have now puts it in the .rodata instead (although I'm not sure that's the best section), (...) Do you think storing the value in a different section is okay
Seems fine to me. I had a look at LD documentation and it seems that the section must be defined prior to using the ALIGNOF()
directive -- maybe you found a workaround for that.
Seems fine to me. I had a look at LD documentation and it seems that the section must be defined prior to using the
ALIGNOF()
directive -- maybe you found a workaround for that.
Yeah, I didn't expect it to work when I first tried it, but I think the reason it can work is this:
The linker uses "lazy evaluation" for expressions; it only calculates an expression when absolutely necessary. The linker needs the value of the start address, and the lengths of memory regions, in order to do any linking at all; these values are computed as soon as possible when the linker reads in the command file. However, other values (such as symbol values) are not known or needed until after storage allocation. Such values are evaluated later, when other information (such as the sizes of output sections) is available for use in the symbol assignment expression.
Because the LONG
value (unlike its size + location) is not needed until after all the sections have been computed, it works and the correct value is added to the final binary.
I'll see if I can post a version of what I have tonight or tomorrow for you to review and we can go from there.