KTX-Software icon indicating copy to clipboard operation
KTX-Software copied to clipboard

Concurrency issues when call ktxTexture2_TranscodeBasis in multithreading.

Open vmwalker opened this issue 1 month ago • 14 comments

Hi, The ktxTexture2_TranscodeBasis function calls the basisu_transcoder_init function, which is not thread-safe. When ktxTexture2_TranscodeBasis is called simultaneously from multiple threads, the basisu_transcoder_init function may run concurrently in multiple threads, corrupting the initialization process of global data structures within basisu and causing the process to crash. The abnormal code snippet is as follows:

// Transcoder global initialization. Requires ~9 milliseconds when compiled
// and executed natively on a Core i7 2.2 GHz. If this is too slow, the
// tables it computes can easily be moved to be compiled in.
static bool transcoderInitialized;
if (!transcoderInitialized) {
    basisu_transcoder_init();
    transcoderInitialized = true;
}

Is there possible to design a thread-safe initialization mechanism to ensure basisu_transcoder_init called only once in multi-threaded environment.

vmwalker avatar Nov 20 '25 13:11 vmwalker

Thank you for your report. I will investigate. Please provide a simple application or a gtest test case to demonstrate the problem. I want to add a test case to our gtest test suite, to the texturetests application I think.

MarkCallow avatar Nov 21 '25 12:11 MarkCallow

The problem has a low probability of occurrence and is difficult to reproduce. I am trying to find ways to increase the probability of reproducing this problem, and I will update you on any progress I make here.

vmwalker avatar Nov 24 '25 11:11 vmwalker

I can reproduce the issue using follow code although the probability is a bit small. I test it with a android phone. The key is make ktxTexture2_TranscodeBasis to be called concurrently at the same time in multiple threads. For this I add a barrier in the task :

void ktxSoftwareTest() {
    std::barrier syncPoint(2);

    auto funcLoad = [&syncPoint] (const std::string& filePath) {
        ktxTexture2 *texture = nullptr;

        KTX_error_code result = ktxTexture_CreateFromNamedFile(
                filePath.c_str(), KTX_TEXTURE_CREATE_LOAD_IMAGE_DATA_BIT, (ktxTexture **) &texture);
        if (result != KTX_SUCCESS) {
            ALOGE("Could not load the requested image file(%d)!", result);
            return;
        }

        if (ktxTexture2_NeedsTranscoding(texture)) {
            auto targetFormat = KTX_TTF_ASTC_4x4_RGBA;

            // The barrier to make ktxTexture2_TranscodeBasis to be called concurrently in multiple threads
            syncPoint.arrive_and_wait();

            ALOGD("transcoding(%s)...", ktxTranscodeFormatString(targetFormat));
            result = ktxTexture2_TranscodeBasis(texture, targetFormat, 0);
            if (result != KTX_SUCCESS) {
                ALOGE("transcode input texture to the target format(%s) failed(%d): %s",
                      ktxTranscodeFormatString(targetFormat), result, ktxErrorString(result));
            } else {
                ALOGD("texture transcoding finished");
            }
        } else {
            ALOGD("no need to do texture transcoding.");
        }

        ktxTexture2_Destroy(texture);
    };

    std::thread t1([&funcLoad] {
        funcLoad("the/path/of/one.ktx2");
    });

    std::thread t2([&funcLoad] {
        funcLoad("the/path/of/another.ktx2");
    });

    t1.join();
    t2.join();
}

vmwalker avatar Nov 24 '25 12:11 vmwalker

Thank you. I have converted the code to a function in our gtest suite. Unfortunately I have not been able to reproduce the problem. Does the size (i.e. time to transcode) make any difference to the probability? Would more threads increase the probability? I want to reproduce it before attempting to add a mutex around the transcoder initialization. That way I can tell if the problem is fixed.

MarkCallow avatar Nov 27 '25 10:11 MarkCallow

As a workaround until a mutex is added, you can call ktxTexture2_TranscodeBasis and wait for it to return before launching your threads. basisu_transcoder_init() only needs to be called once. A global variable is used so make that happen which works fine in unthreaded code and once set, will prevent further calls by the threads.

MarkCallow avatar Nov 27 '25 13:11 MarkCallow

As a workaround until a mutex is added, you can call ktxTexture2_TranscodeBasis and wait for it to return

Shortly after I sent the above, it dawned on me that this very solution was likely preventing me from reproducing the problem. I had added the new test at the end of the test suite and a previous test was calling ktxTexture2_TranscodeBasis. I moved the new test to run first but I still haven't reproduced it after 20 attempts. I set breakpoints at the basisu_transcoder_init(); and transcoderInitialized = true; lines and can reliably see the second thread calling basisu_transcoder_init() before the first thread returns and sets transcoderInitialized. Nevertheless the test is successfully transcoding the input image. @vmwalker, do you have any idea why it is alway succeeding for me?

MarkCallow avatar Nov 28 '25 08:11 MarkCallow

I was able to reproduce the issue once every hundred local tests. My conditions for reproduction were that the ktx file was approximately 3MB. The key to reproducing the problem was that multiple threads simultaneously started calling ktxTexture2_TranscodeBasis. To achieve this, I added syncPoint.arrive_and_wait() to my test code. Theoretically, increasing the number of threads should also slightly increase the probability of reproducing the problem. Additionally, using a less powerful device should also increase the probability of reproducing the problem (increasing the chance of concurrently reallocating memory in basisu_transcoder_init; on my Android phone, each call to ktxTexture2_TranscodeBasis took approximately 40ms). The exception call stack of the problem I reproduced using the debug version is as follows:

11-24 20:10:11.621  9968  9968 F DEBUG   : Time: 2025-11-24 20:10:11
11-24 20:10:11.621  9968  9968 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
11-24 20:10:11.621  9968  9968 F DEBUG   : Build fingerprint: ****
11-24 20:10:11.621  9968  9968 F DEBUG   : Revision: '0'
11-24 20:10:11.621  9968  9968 F DEBUG   : ABI: 'arm64'
11-24 20:10:11.621  9968  9968 F DEBUG   : Timestamp: 2025-11-24 20:10:11.402682266+0800
11-24 20:10:11.621  9968  9968 F DEBUG   : Process uptime: 4s
11-24 20:10:11.621  9968  9968 F DEBUG   : Cmdline: com.****.****
11-24 20:10:11.621  9968  9968 F DEBUG   : pid: 9905, tid: 9965, name: llvm-worker-1  >>> com.****.**** <<<
11-24 20:10:11.621  9968  9968 F DEBUG   : uid: 1000
11-24 20:10:11.621  9968  9968 F DEBUG   : tagged_addr_ctrl: 0000000000000001 (PR_TAGGED_ADDR_ENABLE)
11-24 20:10:11.621  9968  9968 F DEBUG   : pac_enabled_keys: 000000000000000f (PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY, PR_PAC_APDBKEY)
11-24 20:10:11.621  9968  9968 F DEBUG   : signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
11-24 20:10:11.621  9968  9968 F DEBUG   : Abort message: '****/ktx-software/external/basisu/transcoder/basisu_containers_impl.h:42: bool basisu::elemental_vector::increase_capacity(size_t, bool, size_t, object_mover, bool): assertion "m_size <= m_capacity" failed'
11-24 20:10:11.621  9968  9968 F DEBUG   :     x0  0000000000000000  x1  00000000000026ed  x2  0000000000000006  x3  00000070029819c0
11-24 20:10:11.621  9968  9968 F DEBUG   :     x4  6f7164666d68452e  x5  6f7164666d68452e  x6  6f7164666d68452e  x7  7f7f7f7f7f7f7f7f
11-24 20:10:11.621  9968  9968 F DEBUG   :     x8  00000000000000f0  x9  001ece7c670dd341  x10 000000ff00000020  x11 000000003b9ac9ff
11-24 20:10:11.621  9968  9968 F DEBUG   :     x12 000000007fffffff  x13 00000000001d34ca  x14 00000013e3418dd2  x15 0000000034155555
11-24 20:10:11.621  9968  9968 F DEBUG   :     x16 0000007368dbfe68  x17 0000007368da4f40  x18 0000007000e98000  x19 00000000000026b1
11-24 20:10:11.621  9968  9968 F DEBUG   :     x20 00000000000026ed  x21 00000000ffffffff  x22 00000000000026b1  x23 00000000000026cb
11-24 20:10:11.621  9968  9968 F DEBUG   :     x24 00000070029823c0  x25 0000007002982728  x26 0000000000000058  x27 000000700288a000
11-24 20:10:11.621  9968  9968 F DEBUG   :     x28 0000007002886000  x29 0000007002981a40
11-24 20:10:11.621  9968  9968 F DEBUG   :     lr  0000007368d3fcac  sp  00000070029819c0  pc  0000007368d3fcd0  pst 0000000000001000
11-24 20:10:11.621  9968  9968 F DEBUG   : 36 total frames
11-24 20:10:11.621  9968  9968 F DEBUG   : backtrace:
11-24 20:10:11.621  9968  9968 F DEBUG   :       #00 pc 00000000000bacd0  /apex/com.android.runtime/lib64/bionic/libc.so (abort+160) (BuildId: eecc0eb39ce8d4c720af5e5e592df49c)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #01 pc 00000000000bb554  /apex/com.android.runtime/lib64/bionic/libc.so (__assert2+40) (BuildId: eecc0eb39ce8d4c720af5e5e592df49c)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #02 pc 000000000043cb6c  /system/app/****/lib/arm64/lib****.so (basisu::elemental_vector::increase_capacity(unsigned long, bool, unsigned long, void (*)(void*, void*, unsigned long), bool)+96) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #03 pc 0000000000483ce0  /system/app/****/lib/arm64/lib****.so (basisu::vector<unsigned char>::increase_capacity(unsigned long, bool, bool)+60) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #04 pc 0000000000483c24  /system/app/****/lib/arm64/lib****.so (basisu::vector<unsigned char>::try_resize(unsigned long, bool)+184) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #05 pc 0000000000483b44  /system/app/****/lib/arm64/lib****.so (basisu::vector<unsigned char>::resize(unsigned long, bool)+40) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #06 pc 0000000000483aa8  /system/app/****/lib/arm64/lib****.so (astc_helpers::dequant_table::init(bool, unsigned int, bool)+108) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #07 pc 000000000047ea48  /system/app/****/lib/arm64/lib****.so (astc_helpers::dequant_tables::init(bool)+372) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #08 pc 00000000004413c0  /system/app/****/lib/arm64/lib****.so (astc_helpers::init_tables(bool)+32) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #09 pc 0000000000447ec4  /system/app/****/lib/arm64/lib****.so (basist::basisu_transcoder_init()+44) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #10 pc 00000000004399a8  /system/app/****/lib/arm64/lib****.so (ktxTexture2_TranscodeBasis+1656) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #11 pc 0000000000357a78  /system/app/****/lib/arm64/lib****.so (ktxSoftwareTest()::$_2::operator()(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char>> const&) const+380) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #12 pc 0000000000358e84  /system/app/****/lib/arm64/lib****.so (ktxSoftwareTest()::$_1::operator()() const+56) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #13 pc 0000000000358e3c  /system/app/****/lib/arm64/lib****.so (decltype(std::declval<ktxSoftwareTest()::$_1&>()()) std::__ndk1::__invoke[abi:ne190000]<ktxSoftwareTest()::$_1&>(ktxSoftwareTest()::$_1&)+20) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #14 pc 0000000000358df4  /system/app/****/lib/arm64/lib****.so (void std::__ndk1::__invoke_void_return_wrapper<void, true>::__call[abi:ne190000]<ktxSoftwareTest()::$_1&>(ktxSoftwareTest()::$_1&)+20) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #15 pc 0000000000358dd0  /system/app/****/lib/arm64/lib****.so (std::__ndk1::__function::__alloc_func<ktxSoftwareTest()::$_1, std::__ndk1::allocator<ktxSoftwareTest()::$_1>, void ()>::operator()[abi:ne190000]()+24) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #16 pc 00000000003580b8  /system/app/****/lib/arm64/lib****.so (std::__ndk1::__function::__func<ktxSoftwareTest()::$_1, std::__ndk1::allocator<ktxSoftwareTest()::$_1>, void ()>::operator()()+24) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)
11-24 20:10:11.621  9968  9968 F DEBUG   :       #17 pc 000000000023e690  /system/app/****/lib/arm64/lib****.so (std::__ndk1::__function::__value_func<void ()>::operator()[abi:ne190000]() const+56) (BuildId: c4bef52fd665d3d03533743d6b06e76d72c67e56)

vmwalker avatar Dec 02 '25 06:12 vmwalker

@vmwalker, thanks for the additional info. Given the structure of the gtest-based unit tests to which I have added your test and the fact variables within the test library are not accessible because it is a shared library, I can only try multiple runs manually. I will think about how to package it differently with a script to invoke it multiple times. As an alternative I did try creating 100 threads and ran that a few times without provoking the failure.

I think the mutex is best placed in the basis_universal library so I have opened an issue over there.

MarkCallow avatar Dec 02 '25 11:12 MarkCallow

@MarkCallow thank you very much for taking the time to discuss this issue. I'll try adding a mutex when calling ktxTexture2_TranscodeBasis to work around the problem, although this slightly reduces the efficiency of parsing ktx files (because locking is only required the first time the method is called; subsequent calls don't need this mutex, and also this is the most time-consuming part of the entire ktx file parsing process).

vmwalker avatar Dec 03 '25 12:12 vmwalker

I'm still working on this. I have created a branch with the fix but it is not yet merged so I have reopened it.

@vmwalker please test the fix as I am still unable to reproduce the issue. You can find the Android package built from the fix branch at https://github.com/KhronosGroup/KTX-Software/actions/runs/19962184363/artifacts/4776300719.

Most of the changes in the branch have to do with my upping the c++ version used in the library tests to c++20 so I can use std::barrier. The fix is the change in lib/src/basis_transcode.cpp.

MarkCallow avatar Dec 05 '25 12:12 MarkCallow

I made a standalone test that creates five threads and a script to call it from ctest. I removed the putative fix from the library I had the ctest call the test 1000 times. I did not reproduce the failure. I will have to rely on you @vmwalker testing the fix.

MarkCallow avatar Dec 06 '25 11:12 MarkCallow

I made a standalone test that creates five threads and a script to call it from ctest. I removed the putative fix from the library I had the ctest call the test 1000 times. I did not reproduce the failure. I will have to rely on you @vmwalker testing the fix.

Ahh! I had an error in the script so it wasn't looping the number of times I thought. Now its fixed, I have reproduced the problem once so far. A first run of 100 attempts passed. A second run of 1000 failed somewhere around the 120th attempt. But all further tries with the 1000 attempt test have failed. I then upped the attempts to 10000 and finally got another failure. It took 419s which means it was about the 3000th attempt.

I welcome any suggestions you might have on how to improve the probability of failure.

MarkCallow avatar Dec 06 '25 11:12 MarkCallow

Okay, I'll test it in my environment and report the results later.

vmwalker avatar Dec 08 '25 06:12 vmwalker

@MarkCallow I tested the fix you provided, and the problem did not reappear in more than 1,000 tests. Before the fix, the problem would reappear in about 300 tests in my test environment. I think there are two areas for improvement:

  1. To ensure the test code can run correctly in environments older than C++20, I wrote a replacement method for std::barrier, implemented as follows:
class Barrier {
private:
    std::mutex mtx;
    std::condition_variable cv;
    int threshold;
    int count;
    int generation; 

public:
    explicit Barrier(int count) : threshold(count), count(count), generation(0) {}

    void arrive_and_wait() {
        std::unique_lock<std::mutex> lock(mtx);
        int gen = generation;

        if (--count == 0) {
            generation++;
            count = threshold;
            cv.notify_all();
        } else {
            cv.wait(lock, [this, gen] { return gen != generation; });
        }
    }
};
  1. In the code below, a thread is still need hold lock transcoderInitMutex even after the basisu transcoder is initialized, and then release lock quickly after the transcoderInitialized condition is not met. This increases the possibility of multi-threaded contention for the transcoderInitMutex mutex. In fact, we only need to hold transcoderInitMutex when the basisu transcoder is not initialized.
static bool transcoderInitialized;
static std::mutex transcoderInitMutex;
transcoderInitMutex.lock();
if (!transcoderInitialized) {
    // Another thread did not initialize the transcoder while we were waiting.
    basisu_transcoder_init();
    transcoderInitialized = true;
}
transcoderInitMutex.unlock();

The improvement idea is: when a thread requesting transcoderInitMutex, first check if the basisu transcoder has been initialized. If it has been initialized, there is no need to request transcoderInitMutex. This can completely avoid the contention state that occurs when calling ktxTexture2_TranscodeBasis in a multi-threaded environment. Of course, the type of transcoderInitialized also needs to be changed to std::atomic<bool> to ensure that the transcoderInitialized value read by each thread is the same after modifying transcoderInitialized. The improved code is as follows, and this solution has not encountered any problems after more than 1,000 tests.

static std::atomic<bool> transcoderInitialized(false);
static std::mutex init_mutex;
if (!transcoderInitialized.load(std::memory_order_acquire)) {
    std::lock_guard<std::mutex> lock(init_mutex);
    if (!transcoderInitialized.load(std::memory_order_relaxed)) {
        basisu_transcoder_init();
        transcoderInitialized.store(true, std::memory_order_release);
    }
}

vmwalker avatar Dec 12 '25 09:12 vmwalker

@vmwalker, thank you very much for your testing effort and report.

Thanks for the alternate barrier implementation but std::barrier is only used in the test program and I am happy for that to not require anything earlier than c++ 20 as it is available on all the platforms where we run our tests.

Doesn't changing transcoderInitialized to std::atomic<bool> effectively mean that an internal lock is required to access it? If so, there is no advantage over the original code which is simpler.

MarkCallow avatar Dec 16 '25 08:12 MarkCallow

Please answer the question I raised about using an atomic.

I am trying to decide whether it is worth including the test I wrote, that is evolved from yours, in our test suite. It only very occasionally triggers the bug even when I run it 1000 times. As part of our CI we can't afford a test that takes many minutes to run so, if it is included, I'd have to limit it to 100 runs. Even at that level it takes a long time compared to other tests. What to you think?

MarkCallow avatar Dec 17 '25 06:12 MarkCallow

@MarkCallow The implementation of std::atomic does not require internal lock. Compared with ordinary type variables, after a std::atomic type variable is modified in one thread, the modified value can be observed in other threads faster. Before holding the mutex, judging the variable transcoderInitialized can avoid the introduction of new unnecessary multi-thread competition by each thread still trying to acquire the mutex after the initialization of basisu_transcoder_init is completed.

vmwalker avatar Dec 17 '25 10:12 vmwalker

This is a low-probability issue in a multi-threaded environment. I completely agree that we should avoid spending too much effort in test cases to completely avoid a particular issue that is about to be fixed. I think it's not worthwhile.

vmwalker avatar Dec 17 '25 10:12 vmwalker