libctru TLS memory errors with aligned thread-local variables

Born from: https://github.com/Meziu/ctru-rs/issues/60

Description

After some testing on our Rust libraries for this system, we've noticed a weird bug with the use of thread-local storage. With some investigation, we have pinpointed the problem to the use of aligned thread-local variables. As our tests show, the offset of the variable pointers are wrong when one or more aligned TLS variables exist, causing memory leaks, access of uninitialized memory etc. This issue appears in both Rust and C programs.

Furthermore, this issue seems to only affect (at least in our examples) programs with a debug optimization level.

E.g. : debug mode in Rust reproduces the bug. release mode in Rust doesn't reproduce the bug. -Og in C reproduces the bug. -O2 in C doesn't reproduce the bug.

Here is a test program to try the issue. Remember to use `-Og` to replicate:

#include <3ds.h>
#include <stdio.h>
#include <string.h>

typedef ALIGN(4) struct {
    u8 inner[3];
} Align4;

typedef ALIGN(16) struct {
    u8 inner[3];
} Align16;

static __thread Align4 BUF_4 = {.inner = {2, 2, 2}};
static __thread Align16 BUF_16 = {.inner = {1, 1, 1}};

int
main(int argc, char** argv)
{
    gfxInitDefault();
    consoleInit(GFX_TOP, NULL);

    BUF_16.inner[0] = 0;

    bool reproduced = false;

    // Check if at least one byte has been offset improperly
    printf("[");
    for (int i = 0; i < 3; i++) {
        if (BUF_4.inner[i] != 2) {
            reproduced = true;
        }
        printf("%d, ", BUF_4.inner[i]);
    }
    printf("]\n");

    if (reproduced) {
        printf("Bug has been reproduced!\n");
    }
    else {
        printf("There has been no issue!");
    }

    // Main loop
    while (aptMainLoop()) {
        gspWaitForVBlank();
        hidScanInput();

        u32 kDown = hidKeysDown();
        if (kDown & KEY_START)
            break;  // break in order to return to hbmenu

        // Flush and swap framebuffers
        gfxFlushBuffers();
        gfxSwapBuffers();
    }

    gfxExit();
    return 0;
}

Notes about our Rust toolchain

Our Rust programs mainly use libctru (to substitute missing newlib functions), and the arm-none-eabi-gcc linker. Most of the linking flags are the same (basically those in a normal Makefile but in the Rust target-specification). Either way, it's hard to believe this one to be an issue in either of the compilers (rustc and gcc) as the issue reproduces in both. It's likely an issue with libctru or the linker.

Jun 23 '22 09:06 Meziu

Generally, 2*sizeof(void*) is the maximum alignment supported by the TLS section, as a side effect of how it's allocated. I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste. I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?

Re: non-C/C++ languages. It might sound disappointing, but it is not advised to attempt to bring them to devkitPro homebrew environments, as they constitute an endless source of build maintainability and reproducibility problems; not to mention problems caused by necessary and unavoidable API/ABI changes in our upstream, which encourages users of those 3rd party languages to damage their toolchain installations, and ultimately produces unhappy users that are unable to compile simple programs.

Jul 23 '22 09:07 fincs

Generally, 2*sizeof(void*) is the maximum alignment supported by the TLS section, as a side effect of how it's allocated. I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste.

Thanks for the information! It is very interesting, but this explanation doesn't provide an answer to why there is a difference between low or high optimization levels.

I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?

I'd lie if I said I knew one. I have no idea why, but a math library (deep down in a dependency tree for other things I was testing) used alignment in TLS variables. It is important to note however how much this kind of functionality is used in Rust programs. Because of how easy it is to manage multi-threaded applications in Rust, many libraries include into their systems non-optional threading functionality. We are working to make managing 3DS preemptive threads just as easy, but to have native support for these kinds of things we need to make some compromises. For now, we have implemented (a quite wasteful, yet highly compatible) heap-based TLS system. Having out-of-the-box support for Rust's most popular libraries is one of our objectives.

Re: non-C/C++ languages. It might sound disappointing, but it is not advised to attempt to bring them to devkitPro homebrew environments, as they constitute an endless source of build maintainability and reproducibility problems; not to mention problems caused by necessary and unavoidable API/ABI changes in our upstream, which encourages users of those 3rd party languages to damage their toolchain installations, and ultimately produces unhappy users that are unable to compile simple programs.

We know, I know. It was clear since the start that maintaining our toolchain would be the hardest part (especially long term), which is why I always took some precautions before attempting any important decisions. We aim at abstracting libctru code behind a standard Newlib implementation, to ensure longevity and need of small changes in Rust's official tree. If everything goes according to plan, we may have to never modify our code in the Rust std, for example, and even if we had to, by using standard names and functions as any other Newlib implementation (mainly thanks to our multifills rust-linker-fix-3ds and pthread-3ds) it is easier to make contributors understand how the system works.

It's a big task, but we strive to overcome it, thanks to our efforts and those of the many contributors that have already joined us. Also, writing compiled code that doesn't give me nightmares of unspecified ARM panics is a dream I just can't give up on. Hope you can understand where we are coming from.

Jul 23 '22 10:07 Meziu

I'm not sure there is a way to know at runtime which alignment is required by the TLS section. Also note that superaligned variables will cascade the alignment requirement into their parent section, which might cause memory waste. I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?

@fincs would it be possible to set something like this in the linker script?

	.tdata ALIGN(4) :
	{
		__tdata_lma = .;
		LONG (ALIGNOF (.tdata));
		*(.tdata)
		*(.tdata.*)
		*(.gnu.linkonce.td.*)
		. = ALIGN(4);
		__tdata_lma_end = .;
	} : data

The first LONG there could then be used at runtime, something like

	Thread t = (Thread)memalign(*(size_t*)(__tdata_lma),allocsize+tlssize);

Obviously, the rest of the section would be offset by the extra data but I think that could be worked around in the implementation?

From a brief test, I can print the value and it appears to correctly reflect the max alignment, so maybe this would be a way to support arbitrary aligns in TLS? I'm not sure how most TLS implementations handle this, but the linker seems to me like the only place that would have the right information to insert for runtime usage...

I'm also unsure why one would be placing a superaligned variable in TLS, could you enlighten me with a use case?

I don't know too much about the details of the algorithm, or the performance implications, but it comes up in this change to add threading support to a matrix multiplication library: bluss/matrixmultiply@2ddd0ba (#52).

I assume that the algorithm primarily targets x86 platforms but is meant to be portable and efficient, and it looks like they are using the TLS as some kind of buffer for for parallel calculations, maybe so the whole thing fits in a cache line or something like that? I don't know too much about the details or performance implications there.

Aug 09 '22 02:08 ian-h-chamberlain

After a fair bit of trial and error, I think I have a way to make the TLS implementation work for over-aligned variables like these.

I believe it's possible to make the changes in a backwards-compatible way as well, if necessary, and that the implementation I have does not incur a significant additional runtime cost compared to the existing implementation (other than the additional memory usage you mentioned, which should only affect executables that explicitly use these kinds of over-aligned TLS variables).

@fincs would you accept PRs here and in https://github.com/devkitPro/devkitarm-crtls to make this possible?

Aug 17 '22 12:08 ian-h-chamberlain

Sure, go for it. If all, I'd try to avoid placing the LONG() value inside the tdata section if possible.

Aug 17 '22 12:08 fincs

Sure, go for it. If all, I'd try to avoid placing the LONG() value inside the tdata section if possible.

What I have now puts it in the .rodata instead (although I'm not sure that's the best section), since that doesn't throw off the size of .tdata. I was originally going to just use a symbol without any LONG at all (just using ABSOLUTE and never dereferencing the symbol), but this doesn't seem to be supported by 3dsxtool since it becomes an invalid relocation.

Do you think storing the value in a different section is okay, or should I try to work around 3dsxtool to make a "valid" relocation (like __tls_align = __tdata_start + ALIGNOF(...) or something like that)?

Aug 17 '22 14:08 ian-h-chamberlain

should I try to work around 3dsxtool to make a "valid" relocation

I anticipated you would run into this problem. It is indeed not correct to try to expose the desired value as a symbol. Symbols are for addresses to "objects", not raw non-address values; and the linker (and 3dsxtool) will get confused otherwise, thinking there is an object at said address.

What I have now puts it in the .rodata instead (although I'm not sure that's the best section), (...) Do you think storing the value in a different section is okay

Seems fine to me. I had a look at LD documentation and it seems that the section must be defined prior to using the ALIGNOF() directive -- maybe you found a workaround for that.

Aug 17 '22 15:08 fincs

Seems fine to me. I had a look at LD documentation and it seems that the section must be defined prior to using the ALIGNOF() directive -- maybe you found a workaround for that.

Yeah, I didn't expect it to work when I first tried it, but I think the reason it can work is this:

The linker uses "lazy evaluation" for expressions; it only calculates an expression when absolutely necessary. The linker needs the value of the start address, and the lengths of memory regions, in order to do any linking at all; these values are computed as soon as possible when the linker reads in the command file. However, other values (such as symbol values) are not known or needed until after storage allocation. Such values are evaluated later, when other information (such as the sizes of output sections) is available for use in the symbol assignment expression.

Because the LONG value (unlike its size + location) is not needed until after all the sections have been computed, it works and the correct value is added to the final binary.

I'll see if I can post a version of what I have tonight or tomorrow for you to review and we can go from there.

Aug 17 '22 15:08 ian-h-chamberlain

libctru libctru copied to clipboard

TLS memory errors with aligned thread-local variables

Description

Here is a test program to try the issue. Remember to use -Og to replicate:

Notes about our Rust toolchain

libctru
libctru copied to clipboard

Here is a test program to try the issue. Remember to use `-Og` to replicate: